ExplainerRAGHallucinationJune 18, 202611 min read

What is RAG? How AI looks things up instead of guessing

Your chatbot just quoted a refund policy that doesn’t exist. RAG is the one-line fix: make it read the document before it answers.

Same question, two ways to answer

A model guesses from memory. RAG lets it look first.

“What’s our refund window for enterprise plans?”

Closed book

“Enterprise plans come with a standard 30-day refund window.”

Fluent, confident, and invented. Your actual policy is 14 days. The model never saw it, so it filled the gap.

Open book · RAG

“Enterprise plans have a 14-day refund window from the invoice date.”

billing-policy.pdf · p.4retrieved, then answered.

Ask a chatbot a question about your company’s refund policy and watch what happens. It answers instantly, in clean prose, with total confidence: “Enterprise plans come with a standard 30-day refund window.” The tone is perfect. The number is invented. Your real policy is 14 days. That fact lives in a PDF the model has never read.

This is the gap between a model that sounds right and one that is right. A plain chatbot has no way to know what it doesn’t know, so it fills the hole with the most plausible-looking guess. RAG (retrieval-augmented generation) closes that gap with a move so simple it’s almost funny: before the model answers, it looks the answer up.

RAG is the technique of giving an AI model the relevant documents first, then asking it to answer using them, turning a closed-book exam into an open-book one. It’s the most widely used fix for the single biggest problem in practical AI, and the engine behind nearly every “chat with your documents” feature you’ve seen. Here’s how it works, why it helps, and where it still breaks.

A chatbot answers from memory

A language model is, at heart, a very good guesser. During training it read an enormous amount of text and learned the statistical shape of language: which words tend to follow which. That knowledge is baked into its weights and frozen the day training ends. Ask it something, and it generates the most likely next words based on that frozen memory. It has no live connection to your files, your database, or this morning’s news.

Two failure modes follow directly. First, the model can’t know anything specific to you: your contracts, your codebase, your Slack history were never in its training data. Second, when it hits the edge of what it knows, it doesn’t stop; it makes something up, because guessing the next word is the only thing it does. The polished wrong answer about your refund window is both problems at once.

You could retrain the model on your documents, but that’s slow, expensive, and out of date the moment a file changes. RAG takes the other path. Leave the model exactly as it is, and change what you put in front of it.

An open book with its pages lit against a dark background. — Closed-book, the model recites from memory. Open-book, it reads the passage in front of it first. That swap is the whole idea. Photo by Olga Tutunaru on Unsplash.

RAG hands the model an open book

The trick is to stop relying on the model’s memory and start feeding it the source material at the moment you ask. The name spells out the three steps. Retrieve: search a collection of your documents for the passages most relevant to the question. Augment: paste those passages into the prompt as context. Generate: let the model write its answer from that supplied text rather than its hazy recall.

Picture the difference as two students. The first sits a closed-book exam and writes down whatever he half-remembers, some of it right, some confidently wrong. The second is handed the textbook, told which three pages to read, and asked to answer from those. Same student, same brain. The second one is far more likely to be right, and you can check his work against the page he used.

The pattern isn’t new. It was named in a 2020 paper from Patrick Lewis and colleagues at Facebook AI, presented at NeurIPS, which paired a generator with a searchable index of Wikipedia and set state-of-the-art results on three open-domain question-answering benchmarks. Six years on, it’s the default way to point a general-purpose model at private or fast-changing knowledge.

The three letters, in order

1Retrieve

Take the question, search a library of your documents, and pull back the handful of passages most likely to hold the answer.

2Augment

Paste those passages into the prompt, above the question, as the context the model must answer from.

3Generate

The model writes its answer using the supplied passages, not its hazy memory of the whole internet.

Retrieval-augmented generation. The generation part is the chatbot you already know; the retrieval part is everything that happens before it speaks.

It finds the page by meaning, not keywords

The clever part is step one. Old-fashioned search matches words: type “refund,” get documents containing “refund.” But a user might ask “how do I get my money back?” and the policy might say “reimbursement is processed within…”: no shared words, same meaning. Keyword search misses it. RAG mostly doesn’t, because it searches by meaning.

It does this with embeddings. An embedding model converts a chunk of text into a long list of numbers (a vector) that captures its meaning, so that passages about similar ideas land near each other in mathematical space. “Money back” and “reimbursement” end up as neighbors even though they look nothing alike. OpenAI’s text-embedding-3-small turns each passage into 1,536 numbers and costs about $0.02 per million tokens: pennies to index a whole knowledge base.

Those vectors live in a vector database, which is built to do one thing fast: given the vector for a question, find the nearest stored vectors in milliseconds, even across millions of entries. That’s the “retrieve” step in practice: embed the question, ask the database for its closest neighbors, hand the matching passages to the model. A traditional database looks for exact matches; a vector database looks for things that mean nearly the same.

Rows of labelled wooden card-catalog drawers in a library. — A vector index is a card catalog for meaning: every passage filed by what it’s about, so a question can find its answer even when they share no words. Photo by Jan Antonin Kolar on Unsplash.

Grounding cuts hallucination, not to zero

Does it actually work? The evidence says yes, substantially; and also that anyone promising it eliminates hallucination is selling you something. The honest version is “much less, and easier to catch.”

The numbers are real and they’re large. A 2024 study on causal-reasoning tasks reported the average hallucination rate falling from roughly 50% to 13.9% once the same model was given retrieved sources. A 2025 public-health system, MEGA-RAG, cut hallucinations by more than 40% against its baselines. Grounding a model in real text reliably moves the error rate down by a lot.

What grounding buys you

Relative change in made-up answers when the same model is given retrieved sources. Lower is better.

Causal-reasoning tasksgeneral LLM → same LLM with retrieval

Without

With RAG

Public-health Q&A≈40% fewer hallucinations vs. baseline

Without

With RAG

Causal-reasoning figures from Sng et al. (2024); public-health figure from the MEGA-RAG study (2025). Big drops, but none of them reach zero.

But not to zero, and the most sobering evidence comes from the law. A Stanford study led by Varun Magesh tested the big commercial legal-research tools (the ones that market RAG as the cure for hallucination) and found they still invented or misstated things on 17% to 33% of queries. Lower than a raw chatbot, but nowhere near the “hallucination-free” claims on the box. RAG narrows the gap; it does not seal it. The payoff is that every answer now points to a source you can open and check.

RAG is only as good as its retrieval

Here’s the part the marketing skips: a RAG system is only as smart as the passages it pulls. If the retrieval step hands the model the wrong three paragraphs, the model will write a fluent, confident answer based on the wrong three paragraphs. Garbage in, grounded garbage out. Most disappointing “chat with your docs” demos fail here, not at the model.

Two practical traps dominate. The first is chunking: how you slice documents before indexing. Cut too small and a passage loses the context that made it meaningful; cut too large and each chunk is a muddle of topics that matches everything weakly and nothing well. The second is ordering. Models pay most attention to the start and end of what you give them and tend to lose track of material buried in the middle . So burying the best passage in position six of ten can waste a perfect retrieval.

This is also why stuffing everything into a giant context window isn’t a free substitute for retrieval. More text is not more understanding; past a point it’s noise that dilutes the signal, and the usable window is smaller than the advertised one. Good RAG is the opposite move: fetch less, but fetch exactly the right less. Serious systems add a second pass (a reranker that re-scores the top candidates) precisely because the first retrieval is rarely perfect.

Long rows of full bookshelves receding into the distance. — The library can be the whole web, or one shelf of your own files. The smaller and more trusted the shelf, the better the answers. Photo by Peter Herrmann on Unsplash.

RAG vs. fine-tuning vs. a giant window

RAG is one of three ways to make a general model work with your specific knowledge, and people muddle them constantly. Fine-tuning adjusts the model’s weights by training it further on your data. A long context window just means pasting more text into a single prompt. RAG looks things up at question time. They solve different problems.

The rule of thumb: fine-tune to change how the model behaves (its tone, its format, a narrow skill) because that’s baked in and hard to express as a document. Use RAG to change what it knows (facts, policies, product details), especially anything that changes often, because you can edit a file and the next answer reflects it with no retraining. Lean on a big window for a one-off question about a single document you already have in hand.

Three ways to make a model know your stuff

Approach

Best for

Cost to keep current

Freshness

RAG

Facts that change, or that the model never saw

Low: index once, update anytime

Live: edit a file, the answer updates

Fine-tuning

Teaching a style, format, or skill

High: retrain to change anything

Frozen at training time

Bigger context

One-off questions over a single document

Pay per token, every single call

Only what you paste in this time

RAG

Best for: Facts that change, or that the model never saw

Cost: Low: index once, update anytime

Freshness: Live: edit a file, the answer updates

Fine-tuning

Best for: Teaching a style, format, or skill

Cost: High: retrain to change anything

Freshness: Frozen at training time

Bigger context

Best for: One-off questions over a single document

Cost: Pay per token, every single call

Freshness: Only what you paste in this time

They aren’t rivals. A polished system often fine-tunes for tone, uses RAG for facts, and leans on a long window for the document in front of it right now.

The freshness column is the one that decides most real projects. A fine-tuned model is frozen at training time; teaching it this week’s price change means another training run. A RAG system learns the new price the instant you update the source file. For knowledge that moves (and most business knowledge moves), that difference is the whole game.

The most useful RAG is over your own files

The killer use of RAG isn’t answering trivia about the world; the model is already decent at that. It’s answering questions about the documents only you have: the contract, the codebase, the three years of meeting notes, the medical records. That’s exactly the material you have the strongest reasons not to upload to someone else’s server.

Which is the quiet case for running RAG locally. Both halves (the embedding model that indexes your files and the language model that reads them) can run on a modern laptop, so the documents never leave your machine. The library is your own folder; the librarian works on-device. For sensitive or regulated material, that’s not a nice-to-have. It’s often the only architecture you’re allowed to ship. This is the bet CSuite is built on.

Strip away the jargon and RAG is the difference between a smart friend answering off the top of their head and the same friend with the right document open in front of them. The model didn’t get smarter; you gave it something to read. When an AI answer needs to be trustworthy and about your world, that’s the move: let it look first, then ask it to speak.

RAG: quick answers

Does RAG retrain the model?

No. The model’s weights never change. RAG works entirely at question time by slipping relevant passages into the prompt. That’s why you can add a document and get a correct answer about it seconds later, with no training run.

Does RAG eliminate hallucination?

No. It reduces it. Grounding a model in real sources cuts made-up answers sharply, but studies of production tools still find errors in a meaningful share of responses. RAG makes the model far more likely to be right, and far easier to fact-check, because every claim points to a source.

Do I need a vector database?

For a few dozen files, plain keyword search is often enough. Vector search earns its keep once you have thousands of passages and questions that don’t share words with the answer. Most serious setups use both at once.

Is RAG the same as an AI agent?

Related, not identical. RAG is the retrieve-then-answer pattern. An agent is a model that decides which steps to take in a loop, and one of the tools it often reaches for is a RAG search over your documents.

What is RAG? How AI looks things up instead of guessing

A chatbot answers from memory

RAG hands the model an open book

It finds the page by meaning, not keywords

Grounding cuts hallucination, not to zero

RAG is only as good as its retrieval

RAG vs. fine-tuning vs. a giant window

The most useful RAG is over your own files

RAG: quick answers

Text-to-video in 2026: what a sentence gets you now

AI voiceovers without a studio: podcasts, videos, and audiobooks

Which AI should I use? A plain guide to picking one

One-time payment. Yours forever.