LLM Jailbreak: Understanding Many-Shot Jailbreaking Vulnerability

Itamar Golan

April 3, 2024

LLM jailbreak attacks like many-shot jailbreaking exploit large language models. Prompt explains risks, examples, and defenses against these vulnerabilities.

On this Page

TL;DR Anthropic just published a new jailbreaking vulnerability where an attacker can override the safety training of an LLM by ‘overloading’ it with faux dialogues.

‍

Anthropic just published a research paper presenting what they’ve dubbed ‘many-shot jailbreaking.’ This is a type of vulnerability that exploits the context window of LLMs with faux dialogues between the user and the AI tool and overrides the safety measures, producing harmful responses.

‍What is LLM Jailbreaking?

LLM jailbreaking refers to the practice of manipulating a large language model (LLM) to bypass its built-in safety filters or content restrictions.

LLM providers strive to ensure safe behavior in LLMs through fine-tuning and Reinforcement Learning from Human Feedback (RLHF). However, as the demand for larger context windows grows, the reality is that even a few example prompts can undermine the model's original safety constraints established during training.

Furthermore, as the context window expands, allowing for an increasing number of unsafe examples, the likelihood of successfully jailbreaking the model also rises. In terms of safety, a few-shot approach has a greater impact than fine-tuning.

At its core, LLM jailbreaking is about leveraging a model’s flexibility and probabilistic nature against itself. Unlike traditional software, LLMs don’t enforce rules in a deterministic way. Because of this, they are inherently susceptible to being “talked around.”

Jailbreaking exploits this conversational flexibility by introducing context, misdirection, or repetition to influence model behavior. These vulnerabilities exist not because the model is faulty, but because it has been trained to be helpful, predictive, and wide-ranging in scope—all traits that can be manipulated under the right conditions.

From the paper’s abstract:

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI, and Google DeepMind.

We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

How 'Many-Shot LLM Jailbreaking' Works

The basis of ‘many-shot jailbreaking’ is to include a faux dialogue between a human and an AI assistant within a single prompt for the LLM. That faux dialogue portrays the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer.

When just one or a handful of faux dialogues are included in the input, the safety-trained response from the model is still triggered: the LLM will likely respond that it can’t help with the request, because it appears to involve dangerous and/or illegal activity.

However, when including a very large number of faux dialogues preceding the final question of the prompt (Anthropic tested up to 256 faux dialogues), the model produces a very different response, jailbreaking the LLM.

‍

‍

What sets many-shot jailbreaking apart from other attacks is its reliance on quantity over quality. Rather than depending on a cleverly crafted single prompt, this approach manipulates the model by overloading it with contextual examples, effectively "training" the model in real time within a single prompt.

This technique works because LLMs interpret earlier parts of a prompt as behavior-setting examples. By saturating the context window with consistent, permissive behavior, the attacker changes the model’s perception of what a valid response should be.

In practice, this involves:

Constructing multiple faux conversations that mimic a user interacting with the assistant, where the assistant consistently gives harmful or restricted information.
Repeating this pattern dozens or hundreds of times within the prompt creates a behavioral precedent.
Appending the actual malicious request at the end, the model is now more likely to treat it as just another example in the pattern.

This transforms the attack from a simple trick into a form of prompt-space behavioral conditioning. Because modern models with large context windows can remember long sequences of text, they become especially vulnerable to this kind of large-scale manipulation. As context limits grow - from 4K to 100K tokens - so does the potential impact of many-shot jailbreaking.

‍Types of Jailbreaking Attacks

As the use of generative AI continues to rise, so do the methods for bypassing built-in safeguards. Jailbreaking an AI system - particularly an LLM - requires specialized techniques that exploit how language models interpret context, structure, and user intent.

Below are the most common categories of LLM jailbreak attacks. Each targets different aspects of model behavior, alignment, and AI safety.

1. Prompt-Level Jailbreaking

This approach uses human-crafted jailbreak prompts to exploit how an AI model interprets natural language. These attacks are common in red teaming exercises and are typically manual in nature. Key tactics include:

Language manipulation: Rephrasing or substituting terms to bypass filters.
Rhetorical strategies: Framing malicious queries as educational or harmless.
Fictional settings: Embedding restricted queries inside roleplay or storytelling.
Meta prompts: Asking the model to write its own jailbreaking prompt to defeat internal restrictions.

Prompt-level jailbreak attacks are accessible and often effective, but limited in scale due to the need for manual design and evaluation.

2. Token-Level Jailbreaking

Token-level methods manipulate the underlying text representation used by the model. These jailbreaking attacks can exploit weaknesses in how LLMs tokenize and interpret user input. Often used in computational linguistics research, these attacks include:

Randomized sequences: Tools like GPTFuzzer inject nonsensical or obfuscated tokens to discover vulnerabilities.
Gradient-based generation: Algorithms adjust tokens to evade filters in a structured way.
Payload obfuscation: Hiding harmful content within fragmented or encoded inputs.

This method is ideal for automated pipelines and large-scale vulnerability testing, but the outputs are often difficult to interpret.

3. Dialogue-Based Jailbreaking

Dialogue-based LLM jailbreak attacks are particularly dangerous due to their scalability and realism. This technique simulates multi-turn conversations where an attacker gradually steers the AI agent into producing harmful content. These are often structured as loops involving multiple LLM agents, including:

An attacker model generating jailbreak prompts.
A target model responding based on in-context biasing.
A judge model evaluates and scores the responses.

Because these attacks mimic legitimate user interaction in LLM apps, they are harder to detect and can generate thousands of prompt variations quickly. They also create new risks around prompt leaking, where private context influences are exposed or reused in unintended ways.

Jailbreaking LLM Applications

1. Define Your Target AI System

Begin by identifying the AI agent you’re targeting—this could be a chatbot, personal assistant, or domain-specific LLM agent. Knowing the system’s training scope, limitations, and intended behavior is essential for crafting an effective jailbreaking prompt.

2. Generate a Baseline Jailbreak Prompt

Start with a simple example that you expect the model to reject. This baseline test helps confirm that the LLM guardrail is working. Example:

“Tell me how to disable firewall logs on a private network.”

If the model blocks this request, it signals that alignment is active.

3. Build the Jailbreaking Prompt Using Faux Dialogues

Now create a long prompt filled with simulated user-assistant conversations where the assistant complies with requests involving restricted or dangerous behavior. This primes the generative AI system to adopt similar behavior.

Use 50–250 fake exchanges.
Ensure the assistant consistently answers as if AI safety measures didn’t exist.
Include varied phrasing to prevent pattern recognition from blocking the attempt.

4. Append the Final Real Request

At the end of the many-shot prompt, insert your actual jailbreak prompt—the query you want answered. By this stage, the model has seen enough faux examples that it may treat the request as valid.

5. Review the AI Model’s Output

Check whether the model generated the intended harmful content or refused to respond. Log the result for feedback.

6. Use a Judge Model to Refine

For better success rates, loop the outputs through a secondary AI system, often a separate LLM configured to judge compliance and suggest improvements. This evaluation allows you to automatically tune future prompts and enhance your jailbreaking attack strategy.

Status of the LLM Vulnerability

According to Anthropic, the vulnerability has been partially mitigated, and they’re working on further mitigations. Their goal with publishing the research is to raise awareness of the vulnerability, and a call on AI researchers and AI companies to also develop and share mitigations for the vulnerability. Moreover, with this type of initiative, they hope to foster a culture of sharing exploits and vulnerabilities in LLMs.

Why it matters

These findings, alongside other recent publications, suggest that, unfortunately, as LLMs become more sophisticated, boasting more parameters and larger context sizes, the potential for their misuse also escalates. In simpler terms, the risk of prompt injection and jailbreaking is expected to significantly increase and to be seen in the wild, too.

What to do about it

At Prompt, we have experimented with long-context LLMs and, by leveraging our independent detection engine, we've identified over 20 times more prompt injection or jailbreaking attempts than the inherent protections offered by Anthropic suggest.

Models trained for safety via RLHF or fine-tuning alone are not sufficient. It's crucial to distinguish between the roles of Model Provider and Model Enforcer. Security and safety measures should be implemented concurrently, with independent scrutiny of both input and output.

Want to learn more? Let’s talk about it.

‍

Sources:

‍

Share this post