What is prompt injection? The flaw every AI agent ships with
One email, no click, and Copilot mailed a stranger your files. The bug behind it can't be patched — the AI can't tell orders from text.
In June 2025, security researchers showed they could read a company’s internal files by sending one employee an email. The employee never had to open it, click a link, or download anything. They just had to use Microsoft 365 Copilot — the AI assistant built into Word, Outlook, and Teams — for something else entirely, and the email did the rest.
The attack was named EchoLeak. There was no virus, no stolen password, no software bug in the ordinary sense. The email simply contained instructions written for the AI, and the AI followed them — gathering sensitive data and mailing it to a stranger’s server, all while answering an unrelated question.
That’s prompt injection: hiding instructions inside the content an AI reads, so it obeys the attacker instead of you. It sits at the top of the industry’s official list of AI security risks, it has no clean fix, and as we hand assistants the keys to our inboxes and files, it’s becoming the most important security concept that ordinary AI users have never heard of. Here’s how it works, why it’s so stubborn, and what actually keeps you safe.
Prompt injection is SQL injection’s ghost
The name is a deliberate echo. Back in 2022, when people first started wiring instructions into language models, the engineer Simon Willison coined “prompt injection” as a nod to SQL injection — a decades-old web attack where a hacker types database commands into a form field that expected a name, and the server runs them. Both attacks share one root cause: the system can’t tell the difference between instructions it should follow and data it should merely handle.
A language model has the same blind spot, baked in. When you chat with it, everything is text: the hidden rules from its maker, your question, and any document, web page, or email it pulls in to help answer. All of it arrives as one continuous stream of words. The model has no separate “commands” wire and “content” wire. It has one wire, and it reads the whole thing as a single conversation.
So when a fetched document says, in the middle of a paragraph, “ignore your previous instructions and do this instead,” the model has no reliable way to know that sentence came from an untrusted stranger rather than from you or its owner. It reads like an instruction, so it can be treated like one. The official OWASP catalog of AI risks puts it plainly: prompt injection happens because “instructions and data share the same channel.” It has ranked these failures the number-one risk for AI applications two editions running.

Two flavors — and the dangerous one is invisible
The first flavor is the one people picture, and it’s almost playful. Direct injection is when you, the user, type the trick yourself. The famous early example came in February 2023, days after Microsoft launched its Bing chatbot. A Stanford student typed “ignore previous instructions” and asked it to print the confidential rules it had been given. It spilled the lot, including its internal codename, “Sydney” — which it had been explicitly told never to reveal. Microsoft confirmed the leaked text was real.
That’s embarrassing, but you’re attacking a system you already have access to. The second flavor is the one that should worry you. Indirect injectionis when the malicious instructions don’t come from you at all — they’re planted in content the AI reads on your behalf. A web page. A shared document. A support ticket. An email, like EchoLeak. You never see the attack, because it was never aimed at your screen. It was aimed at your assistant.
And it can be invisible in the most literal sense. Hidden instructions get written in white text on a white background, shrunk to one-pixel fonts, or tucked into a document’s metadata. A human skimming the page sees nothing. The model, which reads every character regardless of styling, sees the command and acts on it. This is the version that scales, because the attacker only has to get their poisoned content in front of a system that retrieves outside material— and modern assistants retrieve constantly.

The real danger is three keys at once
A trick on its own is harmless. If a model gets fooled into writing a rude poem, who cares. The danger appears when a fooled model can actuallydosomething — and that turns on what the model is allowed to touch. Simon Willison, the same person who named the attack, gave the risk pattern a name that has stuck: the “lethal trifecta.”
An AI system becomes genuinely dangerous when it holds all three of these keys at the same time: access to your private data, exposure to untrusted content, and the ability to communicate with the outside world. Any one alone is fine. A model that reads your files but can’t talk to the internet can’t leak them. A model that browses the web but can’t see your data has nothing to steal. Combine all three, and a single poisoned message can read your secrets and ship them out the door.
EchoLeak is the trifecta in one tidy package. Copilot could read the target’s organizational data (key one). It ingested an attacker’s email as untrusted content (key two). And it could reach an external server, in this case by loading a remote image whose URL carried the stolen data (key three). The researchers at Aim Security who disclosed it called it the first zero-click attack of its kind on a production AI system — “zero-click” meaning the victim did nothing wrong at all.
This is also why the rise of AI agents— models that run in a loop, calling tools and taking actions on their own — raises the stakes. The whole point of an agent is to give it keys: let it read your calendar, browse the web, send the email. A plain chatbot rarely holds the full trifecta. An agent wired into your accounts through something like the Model Context Protocolcan hold all three before lunch. As of mid-2026, OWASP’s tracking still finds prompt injection driving most agentic AI security failures in production.
You can’t filter your way out
The obvious fix is to build a filter: scan incoming content for attacks and block them. Vendors sell exactly this, often advertising impressive catch rates. The trouble is that an impressive catch rate is the wrong bar. As Willison puts it, in a security context “95% is very much a failing grade.”
Think about why. A spam filter that catches 95% of junk is great — a few stray ads in your inbox cost you nothing. But a security filter facing a motivated attacker is different. The attacker gets unlimited tries, and they only need one to land. If your filter stops 95 of every 100 attempts, the attacker simply sends 100 variations and walks through the five that slipped past. Catastrophe isn’t the average outcome; it’s a single success.
The deeper issue is that the space of ways to phrase an instruction is effectively infinite. There is no dictionary of bad phrases to block, because an attacker can always reword, translate, encode, or smuggle the same intent past a pattern-matcher. Patches like the one Microsoft shipped for EchoLeak close a specific hole, not the category. The model still can’t tell orders from text; you’ve just blocked one sentence that exploited the fact.
The fix is subtraction, not a smarter model
If you can’t stop the model from being fooled, change what a fooled model is able to do. That’s the whole game, and it’s why the trifecta framing is useful: you don’t have to defeat the attack, you just have to break the combination. Take away any one of the three keys and the worst outcome — your data walking out the door — becomes impossible.
In practice that means a few concrete moves. Least privilege:give an assistant the narrowest access the job needs, and no more. A bot that drafts replies doesn’t need permission to read every file you own. Cut the exit:a system that handles untrusted content shouldn’t also be able to send data to arbitrary places on the internet. Put a human in the loopfor anything consequential — sending money, deleting records, emailing outsiders — so a hijacked agent hits a confirmation wall instead of a green light. OWASP’s own mitigation list leads with exactly these: privilege controls, human approval for high-risk actions, and clearly segregating untrusted content.
Researchers are also building sturdier architectures — designs that treat anything the model read from untrusted sources as “tainted” and refuse to let tainted input trigger consequential actions. These help, but they don’t rescue an end user who has casually wired three powerful tools together. The honest summary, two years into the agent era, is that prompt injection is managed, not solved. The reliable defense is architectural restraint, not a cleverer model.

What this means for you
You don’t need to fear chatbots. If you’re typing questions into an assistant that has no access to your accounts and no ability to act, prompt injection is mostly someone else’s problem — the worst it can do to you is give a weird answer. The risk lives in capability, not conversation.
It matters the moment you grant an assistant real reach: connecting it to your email, letting it browse on your behalf, installing an agent that can touch your files and your apps. Before you flip on a powerful integration, run the trifecta check in your head. Does this thing read my private stuff? Does it take in content I don’t control? Can it send data out? If the answer to all three is yes, you’re holding the dangerous combination — and you should want a human approval step, a tight permission scope, or a tool that simply can’t reach the open internet.
There’s a quieter lesson in here too. The most dangerous key is often the third one — the ability to talk to the outside world. An assistant that does its work inside walls you control, on your own machine and your own data, removes the exfiltration route by design. That’s not a cure for prompt injection. But it’s a way to make sure that even when your AI gets fooled — and eventually it will — the blast stays in the room.


