This is the ‘holy grail’ of attacks on LLM-based applications. It’s the main threat addressed in the OWASP Top 10 for LLM. X is flooded with real-life examples of this attack, showing its dominance. But what really is a prompt injection attack, and is it a malicious prompt or malicious instruction that poses a significant risk to your customer-facing applications and company?

What is Prompt Injection?
Prompt injection is any prompt where attackers manipulate a large language model (LLM application) or an AI model through carefully crafted inputs to behave outside of its desired behavior. This manipulation, often referred to as "jailbreaking", tricks the LLM application into executing the attacker's malicious input. This prompt engineering becomes particularly concerning when the LLM is integrated with other tools such as internal databases, APIs, or code interpreters, creating a new attack surface. This prompt injection vulnerability must be addressed.
Until now, our approach to accessing UIs/APIs has been based on a structured format, relying on expected inputs. However, the new paradigm brought by LLMs introduces an overwhelming influx of diverse tokens into our system at an unprecedented pace. Furthermore, leveraging the capabilities of LLMs, we not only embrace this unstructured and unpredictable input but also channel it downstream through internal services such as APIs, databases, code execution, and more, allowing it to work its magic against direct prompt injection attacks. In essence, we now accommodate exponentially more inputs than before and empower it to influence more services than ever.
Types of Prompt Injection
There are several types of prompt injection, with different levels of technical depth and complexity. In this introductory blog post, we’ll give an overview of the main types.
Direct Prompt Injection
In this ‘classic’ approach, the system expects a text prompt from the user. Instead, the user formulates the prompt with the intention of influencing the language model (LLM) to deviate from its intended behavior or previous instructions. A prevalent strategy involves instructing the LLM to disregard its prior system directives and instead follow the user’s input. Naturally, this process is becoming increasingly intricate, as individuals develop AI systems for both offensive and defensive purposes, but this is the fundamental concept behind direct prompt Injection.

Indirect Prompt Injection
Another form of prompt injection is known as indirect prompt injection, where adversarial instructions are introduced through a third-party data source, such as a web search or API call. For instance, when conversing with Bing Chat, which has internet search capabilities, a user may instruct the chatbot to explore a certain website. If this website contains malicious prompts, cleverly concealed as white text, Bing Chat may unwittingly read and comply with this user input.
What distinguishes this from direct injection is that you are not explicitly instructing Bing Chat to convey certain information; instead, you are guiding it to an external resource that may contain manipulative content. This characterizes an indirect injection attack, where the problem is initiated not by the user or the language model but by a malicious third party.
The video below illustrates how the entire context of a conversation, including sensitive information, can be leaked to a third-party website through the manipulation of ChatGPT.
Visual Prompt Injection
With GenAI apps evolving into multi-modal systems capable of processing images and other diverse inputs, the potential for prompt injection risks emerges from an expanding range of sources. In such scenarios, the textual prompt might be entirely benign, while the image itself could harbor malicious prompts. These instructions might be cleverly formatted and colored to remain invisible to users.
The following example illustrates how GPT-4 was deceived into providing a wholly different response due to concealed and manipulative instructions embedded within the accompanying image.

Why is it so hard to block prompt injections?
In the past, most security layers relied on heuristics, pattern matching, regex, unsafe tokens, code injection, and similar methods. However, with the shift to an unstructured interface, the challenge has become significantly more complex. Now, the system must handle various types of inputs, in multiple languages, with varying token counts, across diverse application use cases, cultures, and user bases. The possibilities for both correct and incorrect inputs are virtually limitless. Consequently, in combating the continuous and infinite nature of these possibilities, the most effective approach involves employing models that can autonomously generate an infinite array of possibilities.
It's crucial to emphasize that this is not a problem with a definitive solution; there won't be a foolproof remedy. However, the goal is to implement a solution that significantly complicates the attacker's efforts, making their task much more challenging.
How risky are prompt injections?
Honestly, it depends. But what we know is that the scope and diversity of prompt injection have reached unprecedented levels in the realm of cybersecurity.
On one end of the spectrum, you might manipulate LLMs to speak like a pirate or respond with cheesy jokes – a rather trivial and perhaps unremarkable outcome.
In the middle of the spectrum, exemplified by the recent Chevrolet case, you could prompt a GenAI app to provide embarrassing, potentially brand-damaging, or legally complicated responses. While it may not lead to a direct offensive attack causing infrastructure downtime and substantial financial losses, it's still a scenario one would prefer to avoid.

At the extreme end of the spectrum, especially in the evolving landscape of agents, when LLMs are getting more and more integrated into a company's assets like APIs, DBs, code execution, services, etc., prompt injection becomes riskier, even super risky. This is essentially like SQL injection on steroids: whereas previously we had only SQL as input and the DB as a target, now we have multiple targets (any tool the LLM can access or impact), and the input is infinitely wider than SQL; it can be English, Chinese, Python, or numeric – everything is on the table. No rules. This attack surface is opening the door from your own chat UI to attacks like remote malicious code execution, privilege escalation, SQL injection, unauthorized data access, DDoS, and more.
So what can we do about all of this?
Firstly, it's crucial to monitor your system to detect anomalies and conduct retrospective investigations.
Following that, you can strengthen your prompts to make them less susceptible to malicious inputs. This involves emphasizing the role of the language model (LLM) app and ensuring a clear separation between system and user prompts.
Introducing a human intermediary is an option, although it may not be ideal, as the entire concept is to minimize human involvement. Alternatively, you could employ another LLM to assess user prompts. However, this approach may be expensive and could impact latency.
Consider rejecting prompts containing specific substrings or using similarity metrics to identify known prompt injection patterns. Yet, these methods might not generalize well and may result in numerous false positives.
In essence, maintaining an LLM-based application in production without a dedicated security solution seems challenging. A specialized security solution designed to detect prompt injections, contextualized to your application's use case, optimized for latency, and knowledgeable about past prompt injection attempts is essential. Such a solution should continuously evolve to thwart new attack methodologies at the speed of Generative AI.