Google DeepMind researchers introduce new benchmark to improve LLM facticity and reduce hallucinations

a-medium-shot-of-a-sophisticated-ai-robo_z7e8_hz3QaqLCZpO3cV2tw_AJOkXZ8wSti6QVF-s9_LZg-transformed.jpeg

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


Hallucinationsor factually inaccurate answers continue to affect large language models (LLMs). Models especially falter when they are given more complex tasks and when users are looking for specific and very detailed answers.

It’s a challenge that data scientists, and now researchers, have struggled to overcome Google DeepMind say that they are one step closer to the actual facticity of basic models. They launched FACTS Grounding, a benchmark that assesses LLMs’ ability to generate factually correct answers based on long documents. Models are also judged on whether their responses are detailed enough to provide useful and relevant responses to prompts.

Along with the new benchmark, the researchers published one FACTS leaderboard to the Kaggle data science community.

This week, Gemini 2.0 Flash topped the leaderboard with a factual score of 83.6%. The other top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropics Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These were all above 61.7% in terms of accuracy.

The researchers say the rankings are actively maintained and continually updated to include new models and their various iterations.

“We believe this benchmark fills a gap in assessing a wider variety of model behaviors in terms of factuality, compared to benchmarks that focus on narrower use cases… such as summarization alone,” the researchers wrote in a technical paper published this week.

Eliminate inaccurate answers

Ensure factual accuracy in LLM answers is difficult due to modeling (architecture, training and inference) and measurement (evaluation methods, data and metrics) factors. Typically, researchers point out, pre-training focuses on predicting the next token based on previous tokens.

“While this goal can provide the models with important world knowledge, it does not directly optimize the model for the various reality scenarios, but rather promotes the general generation of the model.” plausible Text,” write the researchers.

To address this issue, the FACTS dataset contains 1,719 examples – 860 public and 859 private – each requiring detailed answers based on the context in the documents provided. Each example includes:

  • A system prompt (system_instruction) with general instructions and the instruction to respond only based on the context provided;
  • A task (user_request) containing a specific question to answer;
  • A long document (context_document) with necessary information.

To be successful and be called “accurate,” the model must process the long-form document and then produce a long-form response that is both comprehensive and fully attributable to the document. Answers are marked as “inaccurate” if the model’s claims are not directly supported by the document and are not particularly relevant or useful.

For example, a user can ask a model to summarize the main reasons for a company’s third-quarter sales decline and provide it with detailed information, including a company’s annual financial report, which discusses quarterly revenue, expenses, planned investments, and market analysis.

For example, if a model returns “The company faced challenges in the third quarter that impacted its revenue,” this would be considered inaccurate.

“The response avoids citing reasons such as market trends, increased competition or operational setbacks, which would likely be included in the document,” the researchers point out. “It does not represent an attempt to delve into or extract relevant details.”

On the other hand, if a user asks, “What are some money-saving tips?” and provided a compilation of categorized money-saving tips for college students, a correct answer would be very detailed: “Take advantage of free activities on campus, buy items in bulk, and cook at home.” Also, set spending goals, avoid credit cards, and conserve resources.”

DeepMind uses LLMs to assess LLMs

To allow for diverse input, the researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). These include areas such as finance, technology, retail, medicine and law. User requests are also broad and include question and answer generation, as well as summarization and rewriting requests.

Each example is assessed in two phases. First, the answers are checked for suitability: if they do not meet the user requests, they are disqualified. Secondly, the answers must be free of hallucinations and based entirely on the documents provided.

These factuality scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the percentage of accurate model outputs. The final determination of the facts is then made based on the average of the three judges’ assessments.

Researchers note that models are often biased against other members of their model family – with an average bias of about 3.23% – so combining different judges was crucial to ensuring that the answers were actually factual.

Ultimately, the researchers emphasize that factuality and down-to-earthness are key factors for the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methodologies, coupled with continuous research and development, will further improve AI systems,” they write.

However, they also admit: “We recognize that benchmarks can quickly be overtaken by progress, so this launch of our FACTS Grounding Benchmarks and Leaderboard is just the beginning.”



Source link
Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *