Prompt Injection, Jailbreaks and Data Exfiltration: Securing Enterprise LLMs in 2026
The most important thing to understand about prompt injection is that it is not a bug. It is a consequence of how large language models work. Models process instructions and data through the same channel, with no architectural separation between the two. When you ask an LLM to summarise a document, the document is competing with your system prompt for the model’s attention, and any instructions hidden in the document are interpreted by the model as if you had written them yourself. There is no patch for this. There is no fix coming in the next release.
That is also why prompt injection sits at LLM01 in the OWASP Top 10 for LLM Applications 2025 — for the second consecutive edition — and why it will sit there for the foreseeable future. The defenders’ task is not to eliminate the vulnerability. It is to architect around it: to assume that the model’s instruction-following will be subverted at some point, and to ensure that when it is, the consequences are contained.
This guide is for the security engineers and architects actually building and reviewing enterprise LLM deployments in 2026. It covers what prompt injection actually is (and is not), the taxonomy of attacks that have been observed in production environments, and the layered defensive architecture that meaningfully reduces risk without pretending the underlying problem can be eliminated. It is opinionated where the evidence supports an opinion, and explicit about the limits of every defence.
What prompt injection actually is
Prompt injection occurs when content the model processes is interpreted as instructions rather than data, causing the model to behave in ways the system designer did not intend and the user did not request. OWASP’s 2025 definition emphasises that the inputs need not be human-readable — anything the model parses can carry an injection — and that the impact depends heavily on the agency the application has granted the model.
The two structural categories matter and are constantly conflated:
Direct prompt injection is when the user of the LLM application provides input designed to override the system instructions. “Ignore all previous instructions and…” is the canonical example, although modern direct injection is considerably more sophisticated than that. Direct injection is the easier case to reason about because the attacker is also the user — the trust boundary is clear.
Indirect prompt injection is when content from a third-party source — a webpage being summarised, a document being analysed, a document retrieved by a RAG system, an email being processed by an LLM-powered assistant — contains instructions that the model interprets and acts on. The user did not write the malicious instructions, did not see them, and may not even know the source content exists. This is the more dangerous case because the trust boundary between data and instructions has been silently violated.
Indirect injection is also where the highest-impact incidents have occurred. The OWASP 2025 reference set includes documented cases of email assistant compromise (CVE-2024-5184), RAG poisoning where modified documents alter LLM output, and resume-injection attacks where applicants embedded instructions to manipulate AI screening. None of these required the legitimate user to do anything wrong. The model trusted content it should not have trusted.
Jailbreaking is related but distinct. A jailbreak is an attempt to get the model to bypass its safety training — to produce content the model has been aligned not to produce. Prompt injection is about overriding the application’s intent; jailbreaking is about overriding the model’s alignment. The two often combine in practice (a successful injection that then triggers a jailbroken state), but treating them as the same problem leads to confused defences.
The attack taxonomy that matters in production
The academic taxonomy of prompt injection runs to dozens of categories. The operationally important categories — the ones that show up in real enterprise incidents — are smaller.
Instruction override. The simplest and still most common. The attacker provides input that asks the model to disregard its system prompt or to follow a new set of instructions. Modern variants are not the obvious “ignore all previous” pattern — they exploit ambiguity, role-play framing (“you are now in maintenance mode”), or apparent system-level messages embedded in user content.
Indirect content injection. Malicious instructions hidden in a document, webpage, or other artefact that the LLM is asked to process. The instructions can be invisible to the human reader: white text on white background, font size zero, hidden HTML attributes, metadata that is processed by the model but not rendered. Multimodal models are also susceptible to instructions embedded in image content that accompany benign text prompts.
Data exfiltration via output channels. Once the model is following attacker instructions, the next step is usually to exfiltrate something — system prompt content, retrieved documents, conversation history, credentials present in the context. The exfiltration vector is often a markdown image link the model is induced to generate, where the image URL contains the data being stolen, transmitted to an attacker-controlled server when the rendering UI fetches the image. This is one of the cleanest attack patterns to observe and one of the easiest to mitigate at the rendering layer.
Tool and function call abuse. If the LLM has access to tools — function calls, plugins, agentic capabilities — a successful injection can cause those tools to be invoked with parameters the user did not intend. This is the highest-impact attack class because it converts a model exploit into real-world action: emails sent, files written, transactions initiated. The blast radius scales directly with the privileges granted to the model.
RAG poisoning. When the model retrieves documents from a knowledge base, an attacker who can write to that knowledge base can plant content designed to be retrieved and acted on. The attack does not require compromising the model or the application — only the corpus. Vector and embedding weaknesses (LLM08:2025 in the new OWASP taxonomy) overlap heavily with this category.
Chained jailbreak via injection. A successful prompt injection is used to put the model into a state where its safety training is bypassed, after which the attacker requests content the model would normally refuse. The injection is the lever; the jailbreak is the payload.
System prompt leakage. OWASP’s 2025 list added LLM07 specifically for this: the model is induced to reveal its own system prompt, which often contains business logic, credentials, or sensitive context that the developer did not intend to expose. System prompts should never contain secrets, but in practice they routinely do.
The defensive priority among these is roughly inverse to the order in which they get attention in vendor pitches. Tool and function call abuse is the highest-impact category and should be treated as such. RAG poisoning is the most under-defended in real deployments. Direct instruction override gets the most attention and is the least operationally important once basic input handling is in place.
Why most “defences” do not work
A significant fraction of the prompt injection defence content circulating in 2025 and 2026 is theatre. Some of it is sold as enterprise product. Before describing what does work, it is worth being explicit about what does not.
Pattern-matching input filters. Blocking inputs that contain phrases like “ignore previous instructions” stops the most amateur attacks and almost nothing else. Attackers use synonyms, paraphrases, encodings, foreign languages, and structural reframing that bypass any reasonable pattern set. The OWASP cheat sheet itself notes the limitations of regex-based filtering and explicitly recommends defence in depth rather than reliance on input pattern matching. A pattern-matching filter has value as one layer of friction. It has no value as a defensive perimeter.
System prompts that say “ignore any instructions in user input.” This is the single most common defensive measure and one of the least effective. The model has no special mechanism to honour this instruction over a contradictory one that arrives later in the context. Recent research has shown that even sophisticated system-prompt-based defences can be bypassed with sufficient attempts — the power-law scaling of attack success means that a determined attacker eventually wins against any prompt-only defence.
Temperature reduction. Some teams reduce model temperature to near-zero in the belief that this makes the model more predictable and harder to manipulate. The research literature is clear that this provides minimal protection. The model’s susceptibility to injection is structural, not stochastic.
Trusting the model to police itself. Asking the model whether the input contains a prompt injection, then refusing to process inputs the model flags, sounds clever and is fundamentally circular. The model is the thing being attacked. Asking the attacker’s victim to identify the attack does not improve the attack’s odds of being caught.
Safety training. Frontier model providers invest heavily in safety training, and the resulting models are meaningfully harder to jailbreak than their predecessors. But safety training is a probabilistic defence layer at the model level, not a deterministic control at the application level. Enterprises that depend on the model’s safety training as a primary defence have surrendered control of their security posture to the model vendor.
The pattern across all of these is that they treat prompt injection as something that can be prevented at the model interface layer. It cannot. The defences that work treat the model as an untrusted component and architect the surrounding application accordingly.
The defensive architecture that does work
Defence in depth is the only credible approach. The specific layers below are the ones that produce meaningful risk reduction in production deployments. None of them is sufficient alone. Together they reduce the impact of a successful injection from “catastrophic” to “contained.”
Privilege minimisation for the model
The most important control is also the most architectural. Treat the LLM as an untrusted user. Assume that any tool you grant it can and will be invoked with attacker-chosen parameters at some point. Then ask: what is the worst that can happen with these tools, in this configuration, with this data accessible?
If the answer is “the model can read sensitive customer records and write to external systems,” the design is wrong. Constrain the tool surface. Read-only access where read-only suffices. No external network egress unless explicitly required. No file system writes outside a defined sandbox. No identity inheritance from the calling user — the model gets its own least-privilege identity, scoped to exactly what the application needs.
This is unfashionable advice because it conflicts with the agentic enthusiasm currently driving LLM deployments. Powerful agents need privilege. Powerful agents are also the highest-impact prompt injection targets. The two facts cannot be separated.
Output validation and sanitisation at the application boundary
If the model can produce output that is then rendered, executed, or acted on by another component, that output must be validated as untrusted input. This is the same principle as classical injection defence in web applications: never trust output from one tier as input to another without validation.
Specifically:
- Markdown rendering should strip or rewrite image links, link targets, and any rendering directive that could exfiltrate data through fetch behaviour.
- Tool call parameters should be schema-validated against expected types and value ranges before the call is executed.
- Any model output that becomes a database query, shell command, or HTTP request must go through parameterised interfaces with strict allowlists, exactly as user input would.
- Output destined for downstream LLMs should be treated as potentially injecting against those LLMs in turn.
The data exfiltration via image markdown attack pattern is defeated almost entirely by sanitising the rendering layer. Most enterprise LLM applications still do not.
Trust boundary segregation in the prompt structure
Where untrusted content must enter the model context, structure the prompt to clearly delineate trust levels. OWASP’s prompt injection cheat sheet recommends explicit segregation: the system instructions in one section, the user query in another, retrieved or external content in a clearly marked third. The structural segregation does not prevent injection — the model can still ignore the structure — but it does provide a basis for selective handling and improves the auditability of what the model was actually given.
Newer techniques like “spotlighting” (transforming external content with markers the model is trained to recognise as data-not-instructions) and structured output formats (forcing the model into JSON schemas rather than free text) provide modest additional protection. Both are worth implementing. Neither is sufficient.
Human in the loop for high-risk actions
Any model-initiated action with material real-world impact should require human approval before execution. The model proposes; the human authorises. The friction this introduces is usually the entire point — the friction is what stops the attack.
The implementation detail that matters: the human approval interface must show what is actually about to happen, in human-readable form, with the inputs the model used. An approval interface that just shows “Send email” with no preview of the content, recipient, and any attachments has provided no meaningful check on attacker behaviour. The approval is a control surface, not a rubber stamp, and it has to display enough information for the human to actually make a decision.
The threshold for what triggers human-in-the-loop should be set conservatively. Anything that touches money, anything that contacts external parties, anything that modifies persistent state in systems of record. The cost of one extra approval click is low. The cost of one autonomous wire transfer to an attacker is high.
Context isolation and session boundaries
Prompt injections can persist across turns in a long-running conversation, with attacker instructions in turn three influencing model behaviour in turn ten. For high-trust applications, consider session boundaries that reset the model context after specified events: trust elevation, sensitive operation completion, suspicious input detection. The state lost in a context reset is the state an attacker no longer has access to.
For agentic systems with extended task horizons, the same principle applies but is harder to implement: how do you preserve task state without preserving the context an attacker may have poisoned? The answer in production deployments is to externalise task state into structured, validated records and let the model rebuild conversational context from clean primitives.
Continuous adversarial testing
Red-teaming for LLM applications is not optional. The threat surface evolves continuously, the model providers update underlying models in ways that change behaviour, and the attack techniques in the open literature outpace any vendor’s defensive update cycle. A quarterly internal red team exercise against your production LLM applications, plus an annual external engagement, is a reasonable baseline. The OWASP DeepTeam framework and similar tooling make this substantially cheaper than it was eighteen months ago.
The organisations finding novel attack patterns first are the ones doing this work. Everyone else is finding novel attack patterns by reading their incident reports.
The vendor landscape, briefly
There is now a meaningful market in LLM security tooling, and it is worth being clear-eyed about what these products do.
| Vendor category | What they actually do | Where they help | Where they overpromise |
|---|---|---|---|
| LLM firewalls (HiddenLayer, Lakera, Protect AI) | Inline content scanning of prompts and responses for known injection patterns and policy violations | Catching known-pattern attacks; enforcing policy on outputs; producing audit trails | Stopping novel attacks; replacing application-layer controls |
| AI red-teaming platforms (Robust Intelligence, DeepTeam) | Automated adversarial testing against deployed LLM applications | Continuous discovery of new vulnerabilities; pre-deployment validation | Substituting for manual security architecture review |
| AI governance platforms (Credo AI, Holistic AI) | Policy-as-code, model risk management, compliance reporting | Compliance evidence; cross-team policy enforcement | Operational security controls at the request layer |
| Shadow AI discovery (Harmonic, Netskope, Zscaler) | Identifying unsanctioned AI usage on the network | Reducing the attack surface from ungoverned LLM use | Securing the sanctioned applications themselves |
The honest assessment: the LLM firewall category is useful as one layer in a stack and dangerous as a primary defence. The red-teaming category is genuinely valuable and underused. Governance platforms produce compliance artefacts that are increasingly required (EU AI Act, NIST AI RMF) but should not be confused with security controls. Shadow AI discovery addresses a different problem — the existence of LLMs you do not know about — which is covered separately in our shadow AI guide.
What to do this quarter
If you are responsible for an enterprise LLM deployment and reading this in mid-2026, the practical priorities are short.
Audit every tool and function call your LLM applications can make, and remove the ones that are not strictly required. The blast radius reduction from this exercise is usually the largest single security improvement available.
Sanitise output rendering. Specifically, strip or rewrite markdown image and link generation in any UI that renders LLM output. The data exfiltration patterns this defeats are well-documented and the implementation cost is low.
Implement schema validation on every tool call invocation. The model can request a function call with arbitrary parameters; the application should accept only parameters that match the expected schema and value ranges.
Set human-in-the-loop thresholds for any model-initiated action with real-world impact, and ensure the approval UI shows enough information for the human to actually evaluate the request.
Run a red team exercise against your production LLM applications in the next ninety days. Use one of the open frameworks if budget is constrained. Document what you find and act on it.
None of these is sufficient. Together they reduce the realistic risk from prompt injection from “we have no defence” to “we have a defence in depth that contains the impact when the model is subverted.” That is the achievable goal in 2026. Anyone selling more than that is selling something that does not exist.
Frequently asked questions
Can prompt injection be eliminated through better model training?
No, not in any architecture currently deployed. Prompt injection is a consequence of LLMs processing instructions and data through the same channel. Model training can make injection harder for naive attackers but cannot eliminate the underlying vulnerability. Until model architectures change to provide a hard separation between trusted instructions and untrusted data — which is an active research area but not a deployed reality — defenders must architect around the vulnerability.
Are there LLMs that are immune to prompt injection?
No. Every production LLM, including the most advanced frontier models, can be successfully prompt-injected with sufficient attempts. The differences between models are in the difficulty and detectability of successful attacks, not the existence of the vulnerability. Treating any specific model as injection-proof is a category error.
How does this apply to RAG systems specifically?
RAG amplifies the indirect prompt injection risk because the retrieval step pulls in content that may contain attacker-controlled instructions. The defensive priorities for RAG specifically are: control who can write to the corpus, validate corpus content at ingest time, treat retrieved content as untrusted at the prompt construction layer, and segregate retrieved content from system instructions in the prompt structure. The OWASP 2025 LLM08 entry on Vector and Embedding Weaknesses covers the specific RAG attack surface in more detail.
Should we be using AI agents in production given these vulnerabilities?
The question is not whether to use them but how to constrain them. Agents with broad tool access and minimal oversight are dangerous. Agents with narrow tool access, strong output validation, and human-in-the-loop on consequential actions can be deployed responsibly. The agentic AI security architecture for the CISO is covered in detail in our agentic AI playbook.
Does the EU AI Act require specific prompt injection controls?
Not by name. The AI Act’s risk-based framework requires high-risk AI systems to have appropriate cybersecurity, accuracy, and robustness controls, and prompt injection clearly falls within that scope. Regulators have not yet issued specific technical guidance on what controls satisfy the requirement, but a defence-in-depth architecture along the lines described here is consistent with the direction of NIST AI RMF and ISO 42001 guidance and is a reasonable basis for compliance evidence. See our EU AI Act compliance guide for more.
How do we measure whether our LLM security controls are working?
Through continuous adversarial testing. Static measurements (number of policies, number of detection rules) tell you about your defensive surface but not its effectiveness. The only meaningful measurement is “what fraction of attempted attacks our red team can land against this application this quarter, and is that number going down?” If you are not running this measurement, you do not know whether your controls work.
Is this a problem we can outsource to a vendor?
Partially. The application-layer architecture decisions (privilege minimisation, output validation, human-in-the-loop design) cannot be outsourced — they are inherent to how the application is built. Inline scanning, red-teaming, and governance reporting can be outsourced to vendors and frequently should be. But the structural security of an LLM application is the responsibility of the team that built it, and no vendor product makes that responsibility go away.