Most AI work can wait. Your bill shouldn't.
Every major AI vendor charges half price if you'll wait a few hours. Most teams pay full price for work nobody is waiting on. A workload-sorting frame, and the four vendors' fine print.
Open the pricing page for any major AI lab. There are two columns. The left column is what you pay if you need an answer in the next second. The right column is half that — what you pay if you’ll accept the same answer within the next day. Same model. Same prompt. Same tokens. Half the bill. The only thing the discount asks for is patience that most workloads already have.
Almost nobody uses it. Talk to a team running a couple of thousand dollars a month through an API and you’ll find evals run live, nightly classification jobs run live, weekly digest emails run live, a backlog of moderation queue items run live. The 50% is sitting on the table because someone wrote the first version of the code against the synchronous endpoint and nobody went back. This post is the going-back memo. It is also a useful frame: not every minute of AI work is worth the same money.
Two columns, half the bill
The mechanic is the same across vendors. You upload a file of requests — the same JSON shape you’d send live, just batched in a list — and you get a file of responses back when the batch is done. The system gives itself up to 24 hours to finish. In practice most batches finish faster. Anthropic says most batches complete in under an hour and OpenAI says batches “often complete more quickly” than the 24-hour ceiling. The clock is the asking price for the discount; you usually get change back.
Anthropic launched the Message Batches API in public beta on October 8, 2024, with general availability on December 17 of the same year. OpenAI had shipped its Batch API earlier, in April 2024. Google’s Gemini Batch Mode arrived in 2025, and AWS Bedrock rolled batch inference into its service-tier menu alongside Standard, Priority, and Flex. By early 2026 the 50% rate is not a beta, not an experiment, and not a regional rollout. It’s the default for patient work.
That table is the cheat sheet. The interesting part is upstream of it: you don’t earn the discount by writing a different SDK call. You earn it by being honest about which of your AI calls actually needed to be alive.
A test for “is this urgent?”
The question isn’t whether the workload could be made synchronous. Almost any workload can. The question is whether a human is waiting on the next token. If there is no human in the loop, or if the human in the loop has already left the page, the latency budget is not seconds — it’s “before the next time someone opens this report.”
Walk through your own call log with that filter. A chat reply is urgent. A code-completion popup is urgent. A live RAG answer is urgent. But the eval run that scores last week’s prompts on a held-out test set isn’t; it just has to be done before tomorrow’s standup. The summarization pass over a 200,000-row support-ticket dump isn’t; it has to be done before the QBR. Generating product descriptions for the new SKU batch isn’t; it has to be ready before the catalog goes live. In each of those cases you are not buying speed; you are buying tokens. And tokens are cheaper after dark.

The four vendors, side by side
Every major hosted LLM provider now runs the same trade. The shape is almost identical: an asynchronous endpoint, a 50% rate, a 24-hour ceiling, results in a file. The differences are in the limits. OpenAI caps a single batch at 50,000 requests or 200 MB; Anthropic goes higher at 100,000 requests or 256 MB; Google’s inline path accepts payloads up to 20 MB and file uploads up to 2 GB; Bedrock passes everything through S3 buckets.
Coverage is broad. Anthropic batches every active Claude model. OpenAI batches Responses, Chat Completions, Embeddings, Moderations, Images, and Video endpoints. Google batches Gemini models across the same modalities. Bedrock supports Claude, Llama, Mistral, Amazon Nova / Titan, Cohere, and AI21 through a single batch interface. There’s no model you’d plausibly want for an offline workload that isn’t batchable on at least one of these providers.
- Chat replyUser typed, spinner spinning
- Code completionIDE popup, tab to accept
- Live RAG answerSearch result on the page
- Voice agent turnPhone call, latency budget <800ms
- Eval runScore 5,000 prompts before standup
- Classification backlogTag a year of support tickets
- Embeddings backfillRe-embed corpus on model change
- Catalog contentProduct descriptions for new SKUs
- Nightly digestSummarize today's docs by 7am
- Dataset labelingBuild training data for a fine-tune
What batch is actually for
Patient workloads are the dull half of the AI bill. They’re also the larger half at most companies, once you measure tokens instead of product surface area. A few concrete cases that are almost always batch-shaped:
- Evals. Scoring a prompt change against a held-out set of 5,000 examples is exactly what batch is for. Nobody is waiting on response 3,217. You want results in your dashboard by morning.
- Bulk classification.Tagging a year of support tickets, labeling 200K product images for an alt-text rollout, categorizing a corpus of legal filings — classification at scale is the canonical batch workload. Quora cited summarization and highlight extraction as their batch use case at launch.
- Embeddings backfills.Re-embedding a corpus when you change models or chunking strategies. The downstream index rebuild doesn’t care whether the embeddings landed at 3am or 3pm.
- Content moderation backlogs.Live moderation (chat, posts) belongs on the synchronous tier. The post-hoc sweep over historical content that nobody flagged the first time doesn’t.
- Generated content for catalogs. Product descriptions, alt text, marketing copy variants, release-note drafts. The deadline is the publish date, not the per-row generation moment.
- Dataset labeling. The training data pipelines that feed your fine-tunes, your distillations, your synthetic augmentations. Producers of training data should never pay live prices.
A useful mental model: anywhere you would have once written a cron job, batch is the AI equivalent. The fact that the discount is 50% and not, say, 5% tells you the vendors actually want you off the synchronous fleet for this work. It frees capacity for the requests where every millisecond matters and somebody is staring at a spinner.

The honest gotchas
Batch is not a free lunch. The 24-hour ceiling is the obvious cost; the less obvious ones are where most teams trip:
- The 24-hour ceiling is a ceiling, not a guarantee. Anthropic explicitly notes that under high demand “you may see more requests expiring after 24 hours.” OpenAI uses softer language: completion is “often” faster, but the 24-hour limit is hard. Design for retry on expiration, not just on error.
- No streaming. No live tool calls.Batch returns a completed response, not a token stream. Any workflow that depends on intermediate tokens — live tool use, interactive agents asking the user a follow-up — can’t be batched. The batch endpoint is for one-shot completions and tool-augmented single-turn requests where the loop is closed.
- Partial failures are normal. A 50,000-request batch is going to have a few hundred rows that hit a content filter, exceed the context window for a particular input, or time out. The result file flags those rows; your downstream code has to treat them as a retry queue, not as a fatal error.
- Rate-limit and spend-limit interactions are unusual. Anthropic notes that batches can briefly exceed your workspace spend limit because of concurrent processing. The batch tier has its own rate limits separate from the live tier, which is mostly good news, but means a single oversized batch can still get queued.
- No Zero Data Retention on Anthropic. The Message Batches API is not eligible for ZDR; the standard 30-day retention applies. Most batches are internal-data jobs where this is fine, but the legal team should know which workloads moved.
- Results expire. Anthropic keeps batch results for 29 days; OpenAI auto-deletes the output file 30 days after completion. Download and persist the results in your own storage if you need them long-term.
None of these are deal-breakers. They’re the kind of constraints you’d expect of an asynchronous API. The point is that “just run it on batch” is not entirely a one-liner: you need a job runner, an error-handling shape, and a place for the result files. Most teams already have all three.
Stack it with caching for 75–95% off
The two discounts compose. Batch is 50% off any token; prompt caching is 50–90% off any token that hits a cached prefix. If a request does both, the multipliers stack. Anthropic publishes the math: on Claude Sonnet 4.6, a cached input token in a batch request costs $0.15 per millionagainst the standard $3 input rate — a 95% discount. On Haiku 4.5 the stacked rate is $0.05 per million. OpenAI stacks to roughly 75% off on cached batch inputs because its cache discount is the smaller 50%.
That changes the workload math more than the headline suggests. A classification batch with a 10,000-token system prompt and 200,000 rows used to pay full price on the system prompt 200,000 times. With caching alone the system prompt cost drops 90% across the batch. With caching plus batch, the same prompt is 19× cheaper than the synchronous, uncached version. The companion piece on prompt caching as the biggest discount in your AI bill walks through the caching mechanics; the short version is: pin the stable bytes (tools, system prompt, retrieved documents) to the front of every request and set the cache-control breakpoints before you submit the batch. Anthropic recommends the 1-hour cache TTL inside batches specifically because batch processing can outlast the default 5-minute window.
Local doesn’t have a fast lane
The whole urgent-vs-patient distinction is a hosted-API artifact. On a local model, every request runs on the one GPU you have. There is no separate batch fleet to migrate to. The pricing is whatever your laptop’s wall power says — a few cents an hour for sustained inference on a modern Apple Silicon machine, less still on a Linux box with an idle 4090. The economics flip: there’s no synchronous tax to dodge, just the question of whether you want to wait for the model on your machine or pay a cloud bill to skip the wait.
That’s the case for the split CSuite uses by default: local models for the constant trickle of small tasks (rewrites, translations, single-document summaries, voice transcription), batch APIs for the rare big jobs (corpus-scale classification, eval runs, embeddings backfills), and the live synchronous tier kept narrow — only the things a human is genuinely staring at. The post on cost-per-task accounting argues this re-sorting at the workload level; batch is the lever for the third pile.

Sort your workloads tomorrow
The exercise is shorter than the post. Pull your last month of AI usage, group it by feature, and put a check next to every feature where a human is staring at the response. Everything without a check is candidate batch traffic. Pick the largest one by token volume and rewrite it to submit a batch instead of a stream of synchronous calls. You’ll know within a day whether the latency is tolerable; if it is, the bill for that workload halves the next billing cycle, and you can stack caching for another 10–40% on top.
AI vendors are pricing your patience because they’d rather have the option to schedule your work than have you fight live traffic for capacity. That’s a deal you can take or leave; what you can’t do is pretend it’s not on offer. The two columns are right there on the pricing page. Most of your work belongs in the right one.


