These researchers used NPR -Sunday puzzle questions to evaluate the “Argumenting” models from AI

These researchers used NPR -Sunday puzzle questions to evaluate the “Argumenting” models from AI


Every Sunday the NPR presenter Will Shortz, the crossword guru of the New York Times, is delivered thousands of listeners in a long-term segment called the the the thousands Sunday puzzle. While written to be without solvable to A lot of knowledge, the braers are usually also challenging for qualified candidates.

For this reason, some experts believe that they are a promising opportunity to test the limits of the problem -solving skills of the AI.

In A Recent studyA team of researchers, at the Wellesley College, on Oberlin College, at the University of Texas in Austin, the Northeasters University, Charles University, and the startup Cursor, created an AI -Benchmark with puzzles from Sunday puzzle episodes. The team says that their test has discovered surprising findings on how this argumentation models – including open o1 – sometimes “give up” and give answers that they know are not correct.

“We wanted to develop a benchmark with problems that people can only understand with general knowledge,” Arjun Guha, member of the computer science faculty at Northeasters and one of the co-authors of the study, told Techcrunch.

The AI ​​industry is currently in a benchmarking dilemma. Most tests that are usually used to evaluate the AI ​​models probe for skills, e.g. Meanwhile many benchmarks – even Benchmarks that were relatively short – approach the saturation point quickly.

The advantages of a public radio quiz game such as the Sunday puzzle is that it is not tested for esoteric knowledge and the challenges are formulated in such a way that models do not fall back on “Red Memory” in order to solve them, said Guha.

“I think what makes these problems difficult is that it is really difficult to make meaningful progress in the event of a problem until they solve it – then everything clicks together at once,” said Guha. “This requires a combination of insight and an elimination process.”

No benchmark is of course perfect. The Sunday puzzle is only US -centered and English. And because the tests are publicly available, it is possible that models that have been trained on it can “cheat” in a sense, even though Guha says he has not seen any evidence of this.

“New questions are published every week and we can expect the latest questions to be really invisible,” he added. “We intend to keep the benchmark fresh and follow how the model output changes over time.”

On the benchmark of the researchers, which consists of around 600 Sunday puzzle puzzles, the argumentation exceed models such as O1 and Deepseeks R1 the rest. Argumentation models thoroughly checks yourself before giving results what helps you Avoid some of the pitfalls That usually stumbles AI models. The compromise is that the argumentation models take a little longer to achieve solutions-in the rule second to minutes longer.

At least one model, Deepseeks R1, gives solutions that you know are wrong for some of the Sunday puzzle questions. R1 literally “I give up”, followed by a wrong answer that was apparently chosen by random – behavior with which this person can certainly refer.

The models make other bizarre decisions, such as the wrong answer, just to withdraw them immediately, try to find out better and to fail again. They also remain “thinking” forever and give nonsensical explanations for answers, or they immediately get a correct answer, but they are checking alternative answers for no obvious reason.

“With hard problems, R1 literally says that it will be” frustrated “,” said Guha. “It was fun to see how a model emulates what a person could say. It remains to be seen how “frustration” in thinking can influence the quality of the fashion results. “

NPR -Benchmark
R1 is “frustrated” in a question in the Sunday Puzzle Challenge.Photo credits:Guha et al.

The current model for the best output on the benchmark is O1 with a score of 59%, followed by the recently published cut O3 mini set to high “argumentation efforts” (47%). (R1 achieved 35%.) As the next step, the researchers are planning to expand their tests to additional argumentation models, from which they hope they hope to identify areas in which these models may be improved.

NPR -Benchmark
The scores of the models that the team tested on their benchmark.Photo credits:Guha et al.

“You don’t need a doctoral thesis to argue well. Therefore, it should be possible to design benchmarks that do not require any knowledge of doctoral students,” said Guha. “A benchmark with broad access enables a larger series of researchers to understand and analyze the results, which in turn can lead to better solutions. Since state-of-the-art models are increasingly being used in settings that affect everyone, we believe that everyone should be intuitively intuitive what these models are and are not in the situation. “



Source link

Spread the love
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *