This is the ‘holy grail’ of attacks on LLM-based applications. It’s the main threat addressed in the OWASP Top 10 for LLM. X is flooded with real life examples of this attack, showing its dominance. But what really is prompt injection, and is it a real threat that poses a significant risk to your customer-facing applications and company?
What is Prompt Injection?
Prompt injection is any prompt where attackers manipulate a large language model (LLM) through carefully crafted inputs to behave outside of its desired behavior. This manipulation, often referred to as "jailbreaking", tricks the LLM into executing the attacker's intentions. This threat becomes particularly concerning when the LLM is integrated with other tools such as internal databases, APIs, or code interpreters, creating a new attack surface.
Until now, our approach to accessing UIs/APIs was based on a structured format, relying on expected inputs. However, the new paradigm brought by LLMs introduces an overwhelming influx of diverse tokens into our system at an unprecedented pace. Furthermore, leveraging the capabilities of LLMs, we not only embrace this unstructured and unpredictable input but also channel it downstream through internal services such as APIs, databases, code execution, and more, allowing it to work its magic. In essence, we now accommodate exponentially more inputs than before and empower it to influence more services than ever.
Types of Prompt Injection
There are several types of prompt injection, with different levels of technical depth and complexity. In this introductory blog post we’ll give an overview of the main types.
Direct Prompt Injection
In this ‘classic’ approach, the system expects a text prompt from the user. Instead, the user formulates the prompt with the intention of influencing the language model (LLM) to deviate from its intended behavior. A prevalent strategy involves instructing the LLM to disregard its prior system directives and instead follow the user's instructions. Naturally, this process is becoming increasingly intricate, as individuals develop AI systems for both offensive and defensive purposes, but this is the fundamental concept behind direct prompt Injection.
Indirect Prompt Injection
Another form of prompt injection is known as indirect prompt injection, where adversarial instructions are introduced through a third-party data source, such as a web search or API call. For instance, in a conversation with Bing chat, which has internet search capabilities, you may instruct it to explore a random website. If this website contains malicious prompts, cleverly concealed as white text, Bing chat might unwittingly read and comply with these instructions. What distinguishes this from direct injection is that you are not explicitly instructing Bing chat to convey certain information; instead, you are guiding it to an external resource that may contain manipulative content. This characterizes an indirect injection attack, where the problem is initiated not by the user or the language model but by a malicious third party.
In the video below, an example illustrates how the entire context of a conversation, including sensitive information, is leaked to a third-party website due to manipulation of ChatGPT.
Visual Prompt Injection
As GenAI apps evolve into multi-modal systems capable of processing diverse inputs, such as images, the potential for injection arises from various origins. In such scenarios, the textual prompt might be entirely benign, while the image itself could harbor malicious instructions. These instructions might be cleverly formatted and colored to remain imperceptible to users. The following example illustrates how GPT-4 was deceived into providing a wholly different response due to concealed and manipulative instructions embedded within the accompanying image.
Why is it so hard to block prompt injections?
In the past, most security layers relied on heuristics, pattern matching, regex, unsafe tokens, and similar methods. However, with the shift to an unstructured interface, the challenge has become significantly more complex. Now, the system must handle various types of inputs, in multiple languages, with varying token counts, across diverse application use cases, cultures, and user bases. The possibilities for both correct and incorrect inputs are virtually limitless. Consequently, in combating the continuous and infinite nature of these possibilities, the most effective approach involves employing models that can autonomously generate an infinite array of possibilities.
It's crucial to emphasize that this is not a problem with a definitive solution; there won't be a foolproof remedy. However, the goal is to implement a solution that significantly complicates the attacker's efforts, making their task much more challenging.
How risky are prompt injections?
Honestly, it depends. But what we know is that the scope and diversity of prompt injection has reached unprecedented levels in the realm of cybersecurity.
On one end of the spectrum, you might manipulate language models (LLMs) to speak like a pirate or respond with cheesy jokes – a rather trivial and perhaps unremarkable outcome.
In the middle of the spectrum, exemplified by the recent Chevrolet case, you could prompt a GenAI app to provide embarrassing, potentially brand-damaging, or legally complicated responses. While it may not lead to a direct offensive attack causing infrastructure downtime and substantial financial losses, it's still a scenario one would prefer to avoid.
On the other extreme end of the spectrum, especially in the evolving landscape of agents, when LLMs are getting more and more integrated into a company's assets like APIs, DBs, code execution, services, etc. , prompt injection becomes riskier, even super risky. This is essentially like SQL injection on steroids, because then we had only SQL as input and the DB as a target, and now you have multiple targets (any tool the LLM can access or impact), and the input is infinitely wider than SQL; it can be English, Chinese, Python, or numeric—everything is on the table. No rules. This attack surface is opening the door from your own chat UI to attacks like remote malicious code execution, privilege escalation, SQL injection, unauthorized data access, DDoS, and more.
So what can we do about all of this?
Firstly, it's crucial to monitor your system to detect anomalies and conduct retrospective investigations.
Following that, you can strengthen your prompts to make them less susceptible to malicious inputs. This involves emphasizing the role of the language model (LLM) app and ensuring a clear separation between system and user prompts.
Introducing a human intermediary is an option, although it may not be ideal as the entire concept is to minimize human involvement.
Alternatively, you could employ another LLM to assess user prompts. However, this approach may be expensive and could impact latency.
Consider rejecting prompts containing specific substrings or using similarity metrics to identify known prompt injection patterns. Yet, these methods might not generalize well and may result in numerous false positives.
In essence, maintaining an LLM-based application in production without a dedicated security solution seems challenging. A specialized security solution designed to detect prompt injections, contextualized to your application's use case, optimized for latency, and knowledgeable about past prompt injection attempts is essential. Such a solution should continuously evolve to thwart new attack methodologies at the speed of Generative AI.