Prompt caching is the biggest discount in your AI bill
Three vendors, three cache mechanics, and a 50–90% discount sitting on the table. Here's how prompt caching actually works in 2026 — and how to design prompts that hit it.
Most AI bills are bigger than they need to be by a factor of two to ten. The reason isn’t which model you picked or how clever your prompt is. It’s that you keep paying full price to re-process the same bytes — the system prompt, the few-shot examples, the documents you stuffed into the context, the tool definitions — on every single API call. The fix has been live for almost two years on every major frontier API and it’s called prompt caching. The headline numbers are in the hero above. The reason this post exists is that almost nobody who isn’t already an LLM-platform engineer is actually using it.
The discount nobody invoices
Token billing pretends every prompt is fresh. The economics of inference don’t. Once a transformer has computed key-value tensors for a stretch of tokens, those KV pairs are deterministic — the same tokens at the same positions produce the same internal state. If the next request starts with the same prefix, the work has already been done. Holding onto the KVs and reusing them is, computationally, almost free. Charging the customer for it again is a habit, not a cost.
Anthropic broke that habit on August 14, 2024 with the public-beta launch of prompt caching for Claude. OpenAI followed on October 1, 2024 at DevDay. Google had shipped explicit context caching for Gemini in May 2024, and extended it to fully implicit caching on May 8, 2025 for Gemini 2.5 and newer. Every shop that runs RAG, agents, or multi-turn chat against these APIs is now leaving 50–90% of their input bill on the table by default — or capturing it, depending on how the prompts are shaped.
The grid above is the cheat sheet. The rest of this post is the mechanics under each row, the workload arithmetic that makes the discount real, and the design rules that decide whether you actually hit the cache or just write to it and pay extra.
What caching actually does
All three vendors share the same shape: cache prefixes, not whole prompts. The cache key is the exact byte sequence from the start of the request up to a breakpoint — tools, then system prompt, then any messages you mark. Change a single token before the breakpoint and the cache misses cleanly; the suffix can vary freely. This is why chat apps that put the user’s question first get nothing from caching, and why agent frameworks that pin a fat system prompt get almost everything.

The economics rotate around three numbers per vendor: the discount on a cache read, the premium (or storage cost) on a cache write, and the time-to-live before the entry evicts. The cache-write premium is the part most teams underweight: writing to the cache is more expensive than a normal input token. If your traffic doesn’t reuse the prefix at least a couple of times before the TTL expires, you spent extra to store something nobody read.
Anthropic — explicit, 90% off
Claude’s caching is the most generous and the most operator-aware of the three. You set up to four cache_control breakpoints in your request — on tools, system, or specific message blocks — and Anthropic caches everything before each breakpoint. Reads cost 10% of the base input rate. Writes cost 1.25× the base rate at the default 5-minute TTL, or 2× at the optional 1-hour TTL. The minimum cacheable prefix is 1,024 tokens on older Sonnets and 4,096 on Opus 4.7, Haiku 4.5, and the latest Sonnet line.
Concretely on Sonnet 4.6 ($3 / $15 per million in/out): a 50K-token cached system prompt costs $0.1875 the first time you write it and $0.015 every time you read it — a 12.5× arithmetic difference compared to paying $0.15 to send the same 50K each call. Anthropic’s launch numbers reported a chat-with-book test going from 11.5s to 2.4s and a 10-turn conversation dropping from ~10s to ~2.5s — the latency story is real because the model literally skips the prefill compute on cache hits, not because the network is faster.
The most underrated detail: the 5-minute TTL refreshes on every readat no additional cost. A busy agent that hits its cached system prompt every minute keeps the entry alive indefinitely without paying to re-warm it. The 1-hour TTL exists for the case where reads are sparse but you still want to amortize the write — nightly batch jobs, infrequent customer follow-ups, embeddings backfills.
OpenAI — automatic, 50% off
OpenAI took the opposite path: nothing to configure, no flags, no breakpoints. Any prompt of 1,024 tokens or more sent to gpt-4o, gpt-4o-mini, o1-preview, o1-mini, or any newer model is hashed in 128-token chunks; if the prefix matches a recent request, you pay 50% lesson the matching tokens. Cached entries live 5–10 minutes idle and up to an hour during sustained traffic. There is no cache-write premium and no storage cost.
That sounds strictly better than Anthropic’s explicit model until you notice what you give up: visibility and control. You can inspect usage.prompt_tokens_details.cached_tokensper call to see hits, but you can’t pin a particular prefix, can’t choose a longer TTL, and can’t guarantee the cache is warm before a latency-sensitive moment. For a chat app this is fine; for an agent framework that wants deterministic cost behavior under load, the absence of breakpoints is a real ergonomic gap.
The 50% number is also smaller than the headline 90% you’ll see on Anthropic and Google. Whether that matters depends on what fraction of your bill is input vs. output. Long-system-prompt agentic workloads skew 10:1 input to output, so a 50% input discount is a 45% bill reduction. Output-heavy chat (long generations, short prompts) sees a much smaller delta.
Google — explicit, then implicit
Gemini caching has the most options and the most footnotes. The API exposes both modes: explicit caching, where you create a cache object with a TTL and reference it by handle, and implicit caching, which is on by default for Gemini 2.5+ and behaves like OpenAI’s automatic match. The cached-input price is 10% of uncached on Gemini 2.5 Pro ($0.125 vs $1.25 per million tokens at ≤200K context) and Gemini 2.5 Flash ($0.03 vs $0.30 per million for text / image / video).
The thing only Google charges for is storage. Explicit caching bills $4.50 per million tokens per hour on 2.5 Pro and $1.00 per million per hour on 2.5 Flash, regardless of whether you read it. That sounds expensive until you do the arithmetic: 50K tokens stored for an hour on Pro is $0.225, and the savings on a single read of the same 50K is already $0.0625. Five reads in the hour and you’re net positive. Implicit caching has no storage fee but offers no guarantees about whether your prefix will actually be there.
Implicit caching on Gemini 2.5 needs a 2,048-token prefix on Flash and a 1,024-token floor on the explicit endpoints. The 4,096 minimum some docs cite is for Gemini 3 Pro Preview — worth checking the model card before you assume your 1.5K-token system prompt qualifies.
A real workload, three bills
The numbers above are abstract. Here’s a concrete scenario: an agent run with a 50,000-token system prompt (RAG context + tool definitions + few-shot examples), 10 turns of 1,000-token user input and 1,000-token assistant output. This is roughly a typical “the AI helps me work through a doc” session. Every turn re-sends the full system prompt. Without caching, the bill is dominated by 10 re-reads of those 50K tokens. With caching, you pay for the prefix once (or zero times, with implicit caching) and read it cheaply 10 times.
Anthropic delivers the deepest absolute discount — 70% off this workload — because the cache-read price is the lowest on the market and the write premium is a one-time charge. OpenAI’s 41% is the smallest of the three and reflects the 50% read discount. Gemini 2.5 Pro lands in the middle at 46%, and the storage line is what closes the gap; for a long-running agent that keeps hitting the same cache for hours, Pro’s economics actually pull ahead because the implicit path skips storage entirely while still giving the 90% read discount.
What gets cacheable: design rules
Caching only pays off if the prefix is actually stable. The rules are the same on every vendor:
- Pin the stable bytes to the front.Tools, system prompt, retrieved documents, few-shot examples — in roughly that order. The user’s message goes last. If you slip a timestamp or a per-request UUID into the system prompt, you just invalidated every downstream cache; remove it or move it to the suffix.
- Reuse the exact same byte sequence.One stray space, a re-ordered tool definition, a JSON key written in a different order, and the cache misses. JSON tool definitions are notorious here — serialize them once, store the string, and reuse it verbatim.
- Mind the minimums.Below the per-model floor (1K / 2K / 4K depending on vendor and model) caching is disabled silently — no error, no discount. If you’re building agentic workflows, pad the system prompt with style guides or schema docs that double as actual context. You almost certainly have things worth saying.
- Stop logging the cache key. Trace logs that record the full prompt are fine; trace logs that diffconsecutive prompts will alert you to invalidations you didn’t mean to cause. Most teams don’t notice their cache hit rate is at 30% until they instrument for it.
- Set explicit breakpoints when you can.On Anthropic, the four breakpoints are a budget. Use one on tools (longest stable chunk), one on the system prompt, one on retrieved documents, and save the fourth for the last user/assistant pair if you’re running long sessions. On OpenAI, you can’t set them; on Gemini, prefer explicit caching for hot prefixes and lean on implicit for cold ones.
Where caching breaks
The failure modes are mostly variants of “the prefix wasn’t actually stable.” Specifically:
- RAG pipelines that retrieve different docs per turn. Caching the system prompt still works, but the retrieved chunk changes the cacheable boundary mid-prompt. Solution: cache up to but not including retrieval, then accept that the retrieved tokens go full price. The system-prompt portion is usually 80% of the input weight anyway.
- Streaming many small prompts in parallel.The first request writes the cache; the next nine in-flight requests racing alongside it might miss because the entry isn’t live yet. Either await the first response before fanning out, or accept the warm-up tax on the first batch.
- Long sessions that exceed the TTL. A user who walks away for an hour and comes back to a chat app on Anthropic with the default 5-minute TTL will pay full freight on their next message unless you preemptively bumped to the 1-hour TTL. A cheap trick: start every session at the 5-minute TTL, switch to 1-hour after the third turn.
- Cache-write premiums on traffic that won’t reuse. One-shot evals, single-user calls, A/B prompt experiments — if there’s no reuse within the TTL, the 1.25× or 2× write premium on Anthropic is pure cost. Gate cache_control behind a “will this prefix be reused” heuristic; the cheapest of those is a request count over the past hour.
- Provider boundaries. Caches are scoped per provider and usually per organization. If you front a multi-provider router (OpenRouter, LiteLLM, your own switch) the cache only warms wherever the request actually went. Sticky routing matters more than people think.
And the local-model footnote
The whole reason to care about the cache-read price is that you’re paying a per-token bill. Local inference doesn’t have that problem — the only resource is your laptop’s GPU and the electricity to run it — and the engines underneath actually do cache KVs across requests for free. llama.cppand Ollama both keep the previous request’s KV cache in memory and skip the prefill if the next request shares its prefix; the discount is 100%, the TTL is “until something else needs the RAM,” and nobody bills you for it.
The flip side is that local inference is rate-limited by your hardware, not your wallet. CSuite’s split — Ollama for text, transformers.js+ONNX for multimodal, BYO keys for cloud spikes — means you can keep the steady-state cheap on local KV reuse and only pay token bills (with caching turned on) when you genuinely need a frontier cloud model. The most useful thing you can do this week is whichever one shortens the gap between the prefix you re-send every call and the prefix the cache can match: pin the system prompt, hash your tool definitions, instrument cached_tokensin your traces, and then look at how the bill changes. The 90% is real. It just won’t come find you.


