EngineeringModelsPerformanceMay 9, 202610 min read

Your million-token context window is lying to you

Every frontier model now advertises a million tokens. The number you actually get — the size at which the model still answers correctly — is much smaller. Here's the gap, the benchmarks, the bill, and a playbook that doesn't pretend.

By Atul

Advertised vs. effective · what the buffer holds vs. what the model still answers

Advertised

1,000,000

tokens of context

Effective on RULER~10× smaller

64,000

for GPT-4 at the 85.6% recall floor

GPT-4 1106

128K claimed · 64K effective

Yi 34B

200K claimed · 32K effective

Mistral 7B Instruct

32K claimed · 16K effective

There’s a number on your model card and a number on your bill. They’re advertised as the same number. They are not. Claude Sonnet 4.6 says one million tokens. Gemini 2.5 Pro says one million. GPT-5.5 says up to one million. The promo screenshot is the same on every product page: paste a whole repo, paste every PDF you own, paste the Lord of the Rings, ask a question. Sometimes it works. The rest of the time the model finds the easy needle and misses the hard one, contradicts itself in the middle of the document, or quietly costs triple what you expected. The advertised number is the size of the input buffer. The number you actually want — the size at which the model still answers correctly — is smaller. Sometimes much smaller.

The advertised number is a buffer size

“Context window” sounds like a quality guarantee. It isn’t. It’s a cap on the attention mask — the longest sequence the model is willing to accept before it starts truncating or erroring. Whether the model actually uses all of those tokens, and whether it answers correctly when the relevant fact is buried at position 700,000, is a separate question that the marketing pages do not answer.

NVIDIA’s RULER paper was the first widely cited attempt to measure the gap. The authors define an “effective context length” as the longest input at which a model still scores above 85.6% on a thirteen-task suite — the same score Llama-2-7B hits at its native 4K. The headline table, reproduced below from the paper’s public results, is the kind of thing you wish vendors put on their own pricing pages.

Claimed vs. effective context · RULER, 85.6% threshold

Model

Claimed

Effective

Drop

GPT-4 1106-preview

128K

64K

−50%

Command-R 35B

128K

32K

−75%

Yi 34B 200K

200K

32K

−84%

Mixtral 8x7B

32K

match

Mistral 7B Instruct

32K

16K

−50%

Llama 2 7B (baseline)

the threshold

Half the models the paper evaluated could not maintain quality even at 32K, despite advertising 32K or more. GPT-4-1106 — the strongest commercial model in the original cohort — lost half its claimed window. Yi-34B-200K kept one-sixth. The number on the box is not the number you can trust.

Two failure modes, one number

The reason long-context numbers feel so trustworthy on launch day is that the test everyone runs is Needle In A Haystack (NIAH): drop a single weird sentence into a long document and ask the model to recite it back. Anthropic’s Claude 3 launch reported >99% NIAH recall on Opus across 200K. Google’s Gemini 1.5 paper reported 99.7% recall at 1M and 99.2% at 10M. By the end of 2024, nearly every frontier model passed NIAH at its full advertised window.

Then people built RULER, which adds twelve harder tasks on top of the single-needle test — multiple needles, multi-hop tracing, variable tracking, common-word extraction, multi-document QA — and the floor fell out. Single-needle retrieval is a memory primitive. Multi-hop reasoning over scattered facts is the actual work. Models that look perfect on one collapse on the other, because the two capabilities scale differently with sequence length. NIAH stayed flat to 1M; RULER bent at a fraction of that. The benchmarks disagreeing is not noise — it’s the reason the marketing number disagreed with your eval results.

A long library aisle lined with tall bookshelves stacked floor to ceiling with books, vanishing into the distance. — A million tokens of shelf space. The hard part is finding the one book that actually answers the question. Photo by Zoshua Colah on Unsplash.

The U-shape: lost in the middle

The other failure mode has been documented since 2023. Liu and colleagues at Stanford published “Lost in the Middle” (TACL 2024) showing that even when the relevant fact is somewhere in the prompt, the model is much more likely to find it if it’s near the start or near the end. Bury it in the middle and accuracy can drop by 20 points or more on multi-document QA — a U-shaped curve that holds across model sizes, including base models that haven’t been instruction-tuned. The shape didn’t come from RLHF; it came from how attention learned to allocate itself.

Multi-doc QA recall vs. position of the relevant document · approximated from Liu et al., 2023

30 retrieved documents, one of them contains the answer. Accuracy depends heavily on where in the prompt that document lands.

The practical version of this is brutal: if your RAG pipeline retrieves ten chunks and the most relevant one happens to land at chunk five, the model is meaningfully worse at finding it than if you’d sorted the relevant chunk to position one or position ten. Re-rankers help. Smaller windows help more. Letting the model decide which chunks to read help most.

The bill scales linearly. Latency more than that.

The cost story behind a long context is uglier than the per-token sticker. Two things bend the curve. First, vendors quietly tier prices at the 200K boundary: Gemini 2.5 Pro doubles input from $1.25/M to $2.50/M on prompts above 200K, and Claude Sonnet 4’s 1M tier charges 2× input and 1.5× output above the same threshold. Sonnet 4.6 and Opus 4.7 dropped that surcharge for the standard 1M window, but the broader pattern still holds: stuff a million tokens in, and somewhere in your bill the line item gets rewritten.

The second bend is latency. Prefill — the work the model does on the prompt before the first output token — is roughly linear in the number of input tokens, but the constant out front is much higher than people assume because attention is quadratic in chunk size and the kernel has to physically read the KV cache off HBM. A 1K-token prompt and a 500K-token prompt do not return the first token at the same speed; the long one takes seconds, sometimes tens of seconds, before the stream starts. The chart below is back-of-envelope math from public pricing — it’s the shape that matters more than the absolute numbers.

Input cost per call · Gemini 2.5 Pro, 2026 list pricing

$1.25 per million input tokens up to 200K. The whole request is re-billed at $2.50/M once it crosses the line. The kink at 200K is not a rendering bug.

The way teams actually pay for this is a hidden tax on every long-context call: they retry. The model gets the wrong answer, the operator stuffs more context in, the bill goes up another linear step, the latency budget evaporates, and the relevant fact is still in the middle. One prompt-cachedhit cuts the cost ladder by 90% on the read, but doesn’t change the recall problem. Caching makes a long prompt cheaper. It does not make it shorter.

Where each model actually cliffs

Databricks ran a long-context RAG study in late 2024 across four QA datasets and published the per-model degradation curves. Their summary is the cleanest practitioner data we have on where the cliffs sit:

GPT-4o & Claude 3.5 Sonnet— flat curves to ~96K with little degradation. The good news of the cohort. Newer frontiers (Sonnet 4.6, Opus 4.7, GPT-5.5) hold quality further but inherit the same shape.
GPT-4 Turbo— visible degradation past 16K. Same number RULER reported on the 1106 snapshot.
Claude 3 Sonnet— performance falls past 16K and copyright refusals jump from 3.7% to 49.5% past 32K. The refusal was technically a different bug but it shipped to users as identical failure.
Llama 3.1 405B— flat to 32K, then steps down. Open-weight reality matches RULER.
Mixtral 8x7B— effective context closer to 4K than 32K on RAG tasks. Open MoEs are particularly fragile here.

Anthropic’s own context-window docs now ship explicit guidance about chunking and retrieval for long prompts — an unusually candid acknowledgement from a vendor that a 1M window is a tool, not a license to dump everything into the prompt.

A playbook that doesn’t lie

None of the above means long context is useless. It means the cost/quality curve is sharper than vendors imply. Five rules that hold across providers:

Retrieve before you dump.A focused 8K-token prompt, assembled from a vector index over the same corpus, consistently beats stuffing the whole corpus into a 200K window. The experiments in the original Lost-in-the-Middle paper and every long-context RAG study since have come to the same conclusion. RAG isn’t obsolete because windows got longer; it got more valuable because it stops you from hitting the parts of the window where the model gets unreliable.
Put the critical bits at the edges. If you must use a long prompt, put the question, the instructions, and the most load-bearing context at the start or the end. The middle is a graveyard. On retrieval pipelines, this means re-ranking so the top hit lands at position 1 or position N, not somewhere in the fourth chunk.
Eval at production length, not at 4K. Quality at 4K tells you nothing about quality at 64K. Build a small eval set that mirrors the prompt size you actually deploy. The cost of running it quarterly is a few dollars; the cost of finding out from a customer is much higher.
Watch the 200K cliff in your invoices. The Gemini and Sonnet-4.5 surcharges trigger on the wholerequest above 200K input tokens, not just the tokens past the boundary. A single retry that pushes you over the line can double the call. If you’re hovering near 200K, the right move is usually to chunk-and-summarize first, not to pay the cliff.
Cache anything stable.Long, identical prefixes (system prompt, tool schemas, retrieved corpora that don’t turn over per call) are the highest-leverage line item in your bill. Caching cuts the cost without solving the recall problem, but it changes the calculus on where you draw the “is this prompt worth shrinking?” line.

And the local footnote

The whole reason to care about input pricing tiers is that you’re paying a per-token bill. On a local model the meter is your own GPU, and the practical limits are RAM and prefill time, not surcharges. That doesn’t make the recall problem disappear — an open-weight 70B class model has the same lost-in-the-middle U-shape as a frontier API, and frequently a steeper RULER cliff — but it does change the trade. With a quantized open-weight model running on your machine you can afford to run more, smaller passes — chunk, summarize, re-rank, then ask the question — for the same wall-clock minute that one cloud call would have spent stuck in prefill. The intermediate steps cost zero tokens.

The honest framing of long context in 2026 is that the buffer is finally big enough to hold whatever you want to ask. The model is not always smart enough to use all of it well. The five rules above don’t go away when the next million-token model ships; they get a little less aggressive each time, by a smaller amount than the marketing implies. RULER will keep being a more useful number than the spec sheet. Treat the advertised window as a ceiling, not a target.

Your million-token context window is lying to you

The advertised number is a buffer size

Two failure modes, one number

The U-shape: lost in the middle

The bill scales linearly. Latency more than that.

Where each model actually cliffs

A playbook that doesn’t lie

And the local footnote

Choosing a local model in 2026: a flowchart

AI for students who don't want to cheat

Offline AI is more useful than you think

One-time payment. Yours forever.