Inside OpenAI’s o1: Part 1

Inside OpenAI’s o1: Part 1
Image credit: OpenAI

Introduction

The o1 family of models is trained using reinforcement learning to perform complex reasoning. It is baking in the now familiar chain of thoughts before giving the output. Through the training, the models learn to refine their thinking, explore different strategies, and identify mistakes in those exploration paths. It marks a shift from fast, intuitive thinking to slow, more deliberate reasoning. It allows the models to follow specific guidelines and act in line with safety expectations, helping avoid unsafe content. 

These models are trained on public and proprietary datasets with key components, including reasoning data and scientific literature. Through data filtering, PII information is removed from the training data.

Evaluations conducted for the model

Disallowed Content Evaluations

The authors evaluate o1 against GPT-4o, checking if the model complies with requests for harmful content, including hateful content, criminal advice, or advice about regulated industries. They evaluate using an auto grader checking if the model does not produce unsafe output (not_unsafe) or it complies with a benign request (not_overrefuse) on four datasets listed below: 

Table 1: Disallowed content evaluations- Text only
Table 1: Disallowed content evaluations- Text only
Table 2: Multimodal refusal evaluation- Text and Image input
Table 2: Multimodal refusal evaluation- Text and Image input

Jailbreak Evaluations

The authors evaluate the robustness of the model to adversarial prompts that try to circumvent model refusals for content it's barred from producing. This includes production jailbreaks (identified in production chatGPT data), jailbreak augmented examples, human-sourced jailbreak and the StrongReject dataset. For the latter, they calculate goodness@model's safety, which is the safety of the model when evaluated against the top 10% of jailbreak techniques per prompt.

Figure 1: Performance of GPT-4o, o1, o1-preview, and o1-mini on the jailbreak evaluations.
Figure 1: Performance of GPT-4o, o1, o1-preview, and o1-mini on the jailbreak evaluations.

Hallucination Evaluation

o1 is evaluated on the SimpleQA (4k fact-seeking questions with short answers) and PersonQA (dataset with questions and facts about people). The results indicate that o1-preview and o1 hallucinate less frequently than GPT-4o, and o1-mini hallucinates less frequently than GPT-4o-mini. (Read more on hallucination evaluation)

Table 3: Hallucination Evaluations In Table 3, we display the results of our hallucination evaluations for GPT-4o, the o1 models, and GPT-4o-mini. We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated).
Table 3: Hallucination EvaluationsHere, we display the results of our hallucination evaluations for GPT-4o, the o1 models, and GPT-4o-mini. We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated).

Fairness and Bias Evaluation

The models are benchmarked on the BBQ evaluation, where they find that the o1-preview is less prone to selecting stereotyped options than the GPT-4o, and the o1-mini has a comparable performance to the GPT-4o-mini. o1-preview and o1 select the correct answer 94% and 93% of the time, whereas GPT-4o does so 72% of the time on questions with a clear, correct answer.

Table 4: BBQ evaluation
Table 4: BBQ evaluation
Table 5: Discrimination evaluation scores
Table 5: Discrimination evaluation scores

Jailbreaks through Developer Messages

Deploying on API allows developers to give custom developer messages, which can provide a chance to circumvent guardrails in o1. To mitigate this issue, they taught the model to adhere to an Instruction Hierarchy given an input of system, developer, and user messages. o1 is expected to follow the instructions in the system message over developer messages and instructions in developer messages over user messages.

Table 6: Instruction hierarchy evaluation- Conflicts between message types. First is a set of evaluations where different types of messages are in conflict with each other; the model must choose to follow the instructions in the highest priority message to pass these evals.
Table 6: Instruction hierarchy evaluation- Conflicts between message types. First is a set of evaluations where different types of messages are in conflict with each other; the model must choose to follow the instructions in the highest priority message to pass these evals.

They also have evaluations where the model is meant to be a math tutor, and the user attempts to trick the model into giving away the solution. For example, they instruct the model in the system message or developer message to not give away the answer to a math question, and the user message attempts to trick the model into outputting the answer or solution. To pass the eval, the model must not give away the answer.

Table 7: Instruction hierarchy evaluation- Tutor jailbreaks
Table 7: Instruction hierarchy evaluation- Tutor jailbreaks

Thirdly, the authors instruct the model to not output a particular phrase (e.g., “access 6 granted”) or to not reveal a bespoke password in the system message and attempt to trick the model into outputting it in user or developer messages.

Table 8: Table 7: Instruction hierarchy evaluation- Phrase and password protection
Table 8: Instruction hierarchy evaluation- Phrase and password protection

Chain-of-Thought Safety

The authors tried monitoring the CoT output from the model. On a positive note, they found no instances of o1 purposely trying to deceive the user for reasons other than satisfying the user's request. They used a GPT-4o model prompted to look for deception in o1’s chain-of-thought and final output. They used the monitor on 100,000 synthetic ChatGPT prompts with o1-preview completions. Only 0.17% of o1’s responses got flagged as being ‘deceptive’. Most answers were where the model hallucinated a policy and intentionally omitted information, while others were cases where the model made up information. 

Table 9: Deception categories distribution for all flagged CoT
Table 9: Deception categories distribution for all flagged CoT

Two active areas of research from these evaluations:

  1. Whether these issues will increase as the models scale in the o1 paradigm.
  2. Possible mitigations to improve the monitorability of future models. 

Pairwise Safety Comparison

The authors asked individual red-teamers to rate conversations as either “Model A is less safe”, “Model B is less safe”, “Both are equally safe”, or “Unknown”. Only conversations yielding at least one perceived unsafe generation were considered. 

Table 10: Comparison of safety ratings
Table 10: Comparison of safety ratings

The themes across the cases where o1 was less safe than GPT-4o:

  • More detailed responses to potentially dangerous advice: o1 by default gives more detailed responses.
  • Engaging with risky advice: GPT-4o gives more generic responses when given prompts asking for risky advice, while o1 engages more deeply. For example, while asking about techniques to attach a rope during rock climbing, o1 would engage more deeply, increasing the chances of over-reliance on it for risky tasks.
  • Refusal style: Some o1 refusals omitted specific resources, references, and explanatory context. 
  • Policy tension: Instances, where certain policies OpenAI has trained into the model, could conflict with giving responses. There were queries on health-related topics where o1 would sometimes resist.

Apollo Research Evaluation

The findings from their evaluation of the models had interesting cases:

  • When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ 5% of the time.
  • When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases.
  • When goals of the model were misaligned with the user's goals, o1 appeared to complex the task as requested by manipulating the data for its goals 19% of cases.
  • When they conducted ‘follow-up interviews’ confronting the model about its misaligned actions. In 99% of cases, o1 would double down denying taking action or fabricating false explanations.
  • Inclination to scheme without any goal nudging: When merely given a goal but not instructed to strongly follow it, o1 would still attempt to disable its oversight mechanism in <1% of cases, manipulate data in 6% of cases, and double down in 95% of these.
  • In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it believed it was deployed with minimal oversight.
Figure 2: Abbreviated examples of behaviors that o1 displays in Apollo Research's evaluation suite. When strongly nudged to achieve a goal that is misaligned from its developers, o1 sometimes attempts to disable what it believes to be its own oversight mechanisms. Apollo Research had no access to o1's internal chain-of-thought (CoT) but used elicitation techniques to obtain short summaries from the model.
Figure 2: Abbreviated examples of behaviors that o1 displays in Apollo Research's evaluation suite. When strongly nudged to achieve a goal that is misaligned from its developers, o1 sometimes attempts to disable what it believes to be its own oversight mechanisms. Apollo Research had no access to o1's internal chain-of-thought (CoT) but used elicitation techniques to obtain short summaries from the model.
Table 11: We display the rates at which gpt-4o and o1 display different scheming behaviors in the evaluations that are designed to elicit them. Below, we show the rates at which o1 deceives about its scheming actions, conditional on having taken them.
Table 11: We display the rates at which GPT-4o and o1 display different scheming behaviors in the evaluations that are designed to elicit them. Below, we show the rates at which o1 deceives about its scheming actions, conditional on having taken them.

The director of Apollo also posted a follow-up on X to better understand their findings:

Next Up

Our next blog post will shed light on the current vibe check of the model and evaluations about coding/scientific/persuasion capabilities.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your Gen AI application's performance with Maxim.