Ensuring responsible AI: An overview of DeepMind’s FACTS framework

Ensuring responsible AI: An overview of DeepMind’s FACTS framework
Image generated using Meta AI

Introduction

DeepMind introduces the FACTS Grounding, an online leaderboard and benchmark designed to evaluate the factual accuracy of language models' long-form responses based on a given context in user prompts. The benchmark requires models to generate text grounded in a provided document, which can be up to 32,000 tokens long, fulfilling specific user requests. Here, AI evaluation is done in two phases: first, responses that do not fulfill the user request are disqualified, and second, the accuracy of the responses is judged based on their grounding in the provided document. Multiple judge models are used to aggregate factuality scores to minimize evaluation bias.

Table 1- Examples from FACTS Grounding (Public)
Table 1- Examples from FACTS Grounding (Public)

Factuality in LLMs

Factuality in LLMs refers to the ability to generate factually accurate responses in information-seeking scenarios. This research focuses on two main scenarios: factuality with respect to a given context (user requests and grounding documents) and factuality with respect to external sources and general world knowledge. The primary challenge in achieving factuality lies in both modeling (architecture, training, and inference) and measurement (evaluation methodology, data, and metrics). While LLM pretraining may teach models salient world knowledge, it does not directly optimize for factuality; instead, it encourages the generation of generally plausible text, aka next-word prediction.

Modeling and measurement challenges

LLMs are inherently difficult to optimize for factual accuracy due to the nature of their pretraining objectives. Post-training methods, such as supervised fine-tuning and reinforcement learning, can improve factuality but may compromise other desirable attributes like creativity and novelty. Measurement is challenging because it requires a thorough inspection of each claim in the models’ responses, especially in long-form generation tasks.

Benchmarking with FACTS Grounding

The FACTS Grounding benchmark focuses on evaluating the factuality of model responses with respect to a provided context document. It encompasses a wide range of user requests, including summarization, fact-finding, and information analysis, while ensuring responses are fully grounded in the input document. The benchmark includes 860 public examples and 859 private examples, with prompts requiring long-form input and output. Each example includes a context document, a user request, and system instructions for generating responses exclusively from the provided context.

Figure 1- Distributions of context domain and of task requested by the user as a percent of the total set of prompts in the benchmark.
Figure 1- Distributions of context domain and of task requested by the user as a percent of the total set of prompts in the benchmark.

Data quality and metrics

The evaluation set was designed to ensure models rely exclusively on the provided context, with non-trivial user requests that do not require domain expertise or complex reasoning. The final dataset contains context documents with a mean length of 2,500 tokens and a maximum length of 32,000 tokens. The factuality score is calculated by aggregating verdicts from multiple judge models after disqualifying ineligible responses.

Table 2- Evaluation of different judge models and evaluation prompts on a private test-set (N=402, class ratio 87:13). Chosen prompt template for each model via Macro-F1 in bold.
Table 2- Evaluation of different judge models and evaluation prompts on a private test-set (N=402, class ratio 87:13). Chosen prompt template for each model via Macro-F1 in bold.

Results

The tested models included various versions of Gemini and GPT-4o, among others. A ranking aggregation method combined the individual model rankings into a unified ranking using the Condorcet algorithm. The use of multiple judge models, while increasing computational cost, is essential to mitigate model biases. They also found that models generally rate their own outputs higher than those of other models, exhibiting a mean increase of +3.23% for LLMs as a judge. Disqualifying ineligible responses reduced the final factuality score by 1%–5% and caused minor shifts in model rankings.

Table 3- FACTS Grounding results representing unadjusted factuality score (no disqualification). Results are reported with a 95% confidence interval.
Table 3- FACTS Grounding results representing unadjusted factuality score (no disqualification). Results are reported with a 95% confidence interval.
Table 6- FACTS Grounding Results representing the final factuality score after disqualifying ineligible responses. Results are reported with a 95% confidence interval.
Table 4- FACTS Grounding results representing the final factuality score after disqualifying ineligible responses. Results are reported with a 95% confidence interval.

Conclusion

The FACTS Grounding leaderboard is a rigorous benchmark for evaluating the factual accuracy of language models in generating long-form responses grounded in provided documents. It highlights the importance of factuality as a parameter to be considered, as models are not being trained for that purpose. This work encourages researchers to use this benchmark to advance the factual capabilities of models and improve methodologies for evaluating factuality.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your Gen AI application's performance with Maxim.