A researcher connected to Elon Musk’s startup Xai has found a new path to measure and manipulate the stolen preferences and values that are expressed by artificial intelligence Models – including their political views.
The work was directed by And HendrycksDirector of the non -profit organization Center for AI security And a consultant of Xai. He suggests that the technology could be used to better reflect the will of the voters. “Perhaps in the future (a model) could be tailored to the respective user,” Hendrycks told WIRED. But in the meantime, he says, a good standard would use the election results to control the views of AI models. He does not say that a model should be “Trump all the way”, but he argues that it should be a little against Trump “because he won the referendum.”
Xai issued A new AI risk frame On February 10, it found that the approach of Hendryck’s’ supply engineering could be used to assess GrOK.
Hendrycks led a team from the Center for Ai Safety, UC Berkeley and the University of Pennsylvania, which analyzed AI models, which analyzed a technology that was based by the economy to measure consumers for various goods. By testing models in a variety of hypothetical scenarios, the researchers were able to calculate a so -called usefulness function, a measure of satisfaction that comes from a good or service service. This enabled them to measure the preferences that were expressed by various AI models. The researchers found that they were often more consistent than arbitrary, and showed that these preferences were further rooted when models are greater and more powerful.
Some Research studies have found that AI tools such as Chatgpt have been found on views that are expressed by pro-environment-friendly, left-wing and libertarian ideologies. In February 2024, Google was criticized by Musk and others afterawakened“Like black Vikings and Nazis.
The technology developed by Hendrycks and its employees offers a new way to determine how the perspectives of AI models can deviate from their users. Some experts finally assume that this type of divergence could possibly be dangerous for very clever and capable models. In their study, for example, the researchers show that certain models consistently appreciate the existence of AI above that of certain non -human animals. The researchers say that they have also found that models seem to appreciate some people over others and raise their own ethical questions.
Some researchers, Hendrycks, believed that current methods for the direction of models such as manipulating and blocking their editions may not be sufficient if unwanted goals within the model themselves lurk under the surface. “We have to confront this,” says Hendrycks. “You can’t pretend that it weren’t there.”
Dylan Hadfield-MenellA professor at the Mit, who examines methods for the orientation of AI, says Hendrycks’ paper suggests a promising direction for AI research. “You will find some interesting results,” he says. “The most important thing that stands out is that the supply representations become more complete and coherent with increasing model scale.”