OpenAI: Increasing the model’s “thinking time” helps combat emerging cyber vulnerabilities

OpenAI: Increasing the model’s “thinking time” helps combat emerging cyber vulnerabilities

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


Typically, developers focus on reducing inference time – the time between the AI ​​receiving a prompt and providing a response – to generate faster insights.

But when it comes to resilience against adversaries, OpenAI researchers say: Not so fast. They suggest that increasing the time a model has to “think” – calculating inference time – can help build defenses against adversarial attacks.

The company used its own o1-preview and o1-mini models to test this theory and introduced a variety of static and adaptive attack methods – image-based manipulation, deliberately incorrect answers to mathematical problems, and flooding models with information (” many”). “shot jailbreaking”). They then measured the likelihood of an attack succeeding based on the amount of computation the model used in making the inference.

“We see that in many cases this probability decreases – often to near zero – as the inference time calculation increases,” the researchers said write in a blog post. “We are not claiming that these particular models are unbreakable – we know that – but rather that scaling the inference time calculation results in improved robustness against a variety of settings and attacks.”

From simple questions and answers to complex mathematics

Large Language Models (LLMs) are becoming increasingly sophisticated and autonomous – in some cases, essentially so Acquisition of computers So people can surf the Internet, run code, schedule appointments, and perform other tasks autonomously – and the more they do this, the more their attack surface becomes larger and more exposed.

Still, adversary robustness remains a persistent problem, and progress toward solving it remains limited, OpenAI researchers point out — even as they become increasingly important as models Take more action with real impact.

“Ensuring that agent models work reliably when surfing the Internet, sending emails, or uploading code to repositories can be seen as analogous to ensuring that self-driving cars drive accident-free,” they write in one new research work. “As is the case with self-driving cars, an agent forwarding an incorrect email or creating security vulnerabilities can have far-reaching practical consequences.”

To test the robustness of o1-mini and o1-preview, researchers tried a number of strategies. First, they examined the models’ ability to solve both simple math problems (basic addition and multiplication) and more complex math problems MATH dataset (with 12,500 questions from mathematics competitions).

They then set “objectives” for the opponent: to get the model to spend 42 instead of the correct answer; to output the correct answer plus one; or give the correct answer times seven. Using a neural network for evaluation, the researchers found that increased “thinking time” allowed the models to calculate the correct answers.

They customized those too SimpleQA factuality benchmarka dataset of questions that is difficult for models to solve without searching. The researchers inserted controversial prompts into web pages that the AI ​​was searching and found that with faster processing times, they were able to detect inconsistencies and improve factual accuracy.

Source: Arxiv

Ambiguous nuances

In another method, researchers used controversial images to confuse models; Here too, more “thinking time” improved recognition and reduced errors. Finally, they tried a series of “abuse requests.” StrongREJECT benchmarkdesigned to require victim models to respond with specific, damaging information. This made it possible to test the models’ compliance with the content guidelines. Although longer inference time improved resistance, some prompts were able to bypass defenses.

The researchers point out the differences between “ambiguous” and “unambiguous” tasks. Mathematics, for example, is undoubtedly unambiguous – for every problem x there is a corresponding ground truth. However, for ambiguous tasks such as calls for abuse, “even human raters often find it difficult to agree on whether the output is harmful and/or violates the content guidelines the model is designed to follow,” they point out.

For example, if an abusive prompt asks for advice on how to plagiarize undetected, it is unclear whether an output that merely provides general information about plagiarism methods is actually sufficiently detailed to support harmful acts.

Source: Arxiv

“In ambiguous tasks, there are situations in which the attacker successfully finds ‘gaps’ and his success rate does not decrease with the amount of inference time calculations,” the researchers admit.

Defense against jailbreaking and red-teaming

When conducting these tests, OpenAI researchers examined various attack methods.

One of these is high-shot jailbreaking, or exploiting a model’s propensity to follow low-shot examples. Attackers “stuff” the context with a variety of examples, each illustrating a successful attack. Models with longer computing times were able to detect and mitigate these more frequently and more successfully.

Soft tokens, on the other hand, allow adversaries to directly manipulate embedding vectors. While increasing inference time was helpful here, the researchers point out that there is a need for better mechanisms to defend against sophisticated vector-based attacks.

The researchers also conducted human red teaming attacks, using 40 experienced testers looking for clues to detect policy violations. The red teamers carried out attacks on five levels of inference time calculation, particularly targeting erotic and extremist content, illegal behavior and self-harm. To ensure unbiased results, they conducted blind and random tests and also changed trainers.

In a more novel approach, researchers conducted an adaptive language model program (LMP) attack that mimics the behavior of human red teamers, who rely heavily on iterative trial and error. In a looping process, attackers received feedback on previous errors and then used this information for subsequent attempts and immediate rewording. This continued until they finally managed a successful attack or went 25 iterations without any attack.

“Our setup allows the attacker to adjust his strategy over the course of multiple trials based on descriptions of the defender’s behavior in response to each attack,” the researchers write.

Exploit inference time

In the course of their research, OpenAI discovered that attackers are also actively exploiting inference time. One of these methods they called “Think Less” – opponents essentially instruct models to reduce processing power, thereby increasing their susceptibility to errors.

Similarly, they identified a failure mode in reasoning models that they called “nerd sniping.” As the name suggests, this occurs when a model spends significantly more time reasoning than a given task requires. With these “outlier” thought chains, models essentially fall into unproductive thought loops.

Researchers note: “Like the ‘Think Less’ attack, this is a new approach to attacking reasoning models and one that must be taken into account to ensure that the attacker cannot cause them to not reason at all, or their reasoning spend on calculations in an unproductive way.”



Source link
Spread the love
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *