EconomicsEngineeringField GuideMay 19, 202611 min read

Most AI work can wait. Your bill shouldn't.

Every major AI vendor charges half price if you'll wait a few hours. Most teams pay full price for work nobody is waiting on. A workload-sorting frame, and the four vendors' fine print.

By Atul

Same model, same prompt · Claude Sonnet 4.6

Live, in milliseconds

$3.00

per million input tokens

Within 24 hours50% off

$1.50

per million batch input tokens

OpenAI

50% off · 24h

Anthropic

50% off · 24h

Google

50% off · 24h

AWS Bedrock

50% off · 24h

Open the pricing page for any major AI lab. There are two columns. The left column is what you pay if you need an answer in the next second. The right column is half that: what you pay if you’ll accept the same answer within the next day. Same model. Same prompt. Same tokens. Half the bill. The only thing the discount asks for is patience that most workloads already have.

Almost nobody uses it. Talk to a team running a couple of thousand dollars a month through an API and you’ll find evals run live, nightly classification jobs run live, weekly digest emails run live, a backlog of moderation queue items run live. The 50% is sitting on the table because someone wrote the first version of the code against the synchronous endpoint and nobody went back. This post is the going-back memo. It is also a useful frame: not every minute of AI work is worth the same money.

Two columns, half the bill

The mechanic is the same across vendors. You upload a file of requests (the same JSON shape you’d send live, just batched in a list) and you get a file of responses back when the batch is done. The system gives itself up to 24 hours to finish. In practice most batches finish faster. Anthropic says most batches complete in under an hour and OpenAI says batches “often complete more quickly” than the 24-hour ceiling. The clock is the asking price for the discount; you usually get change back.

Anthropic launched the Message Batches API in public beta on October 8, 2024, with general availability on December 17 of the same year. OpenAI had shipped its Batch API earlier, in April 2024. Google’s Gemini Batch Mode arrived in 2025, and AWS Bedrock rolled batch inference into its service-tier menu alongside Standard, Priority, and Flex. By early 2026 the 50% rate is not a beta, not an experiment, and not a regional rollout. It’s the default for patient work.

Batch APIs at the four major hosted AI providers

Dimension

OpenAI

Anthropic

Google

AWS Bedrock

Discount

50% in & out

SLA

24h ceiling

Typical finish

Often <1h

Most <1h

Often <1h

Varies

Max requests / batch

50,000

100,000

Inline or 2 GB file

S3 file

Max file size

200 MB

256 MB

20 MB inline / 2 GB

S3 (no fixed cap)

Endpoints / modalities

Chat, Responses, Embeddings, Images, Video

All Messages features

Gemini text/image/video

Claude, Llama, Mistral, Nova, Cohere, AI21

Result retention

30 days

29 days

Varies

Your S3 bucket

GA date

Apr 2024

Dec 2024

2025

2024

That table is the cheat sheet. The interesting part is upstream of it: you don’t earn the discount by writing a different SDK call. You earn it by being honest about which of your AI calls actually needed to be alive.

A test for “is this urgent?”

The question isn’t whether the workload could be made synchronous. Almost any workload can. The question is whether a human is waiting on the next token. If there is no human in the loop, or if the human in the loop has already left the page, the latency budget is not seconds : it’s “before the next time someone opens this report.”

Walk through your own call log with that filter. A chat reply is urgent. A code-completion popup is urgent. A live RAG answer is urgent. But the eval run that scores last week’s prompts on a held-out test set isn’t; it just has to be done before tomorrow’s standup. The summarization pass over a 200,000-row support-ticket dump isn’t; it has to be done before the QBR. Generating product descriptions for the new SKU batch isn’t; it has to be ready before the catalog goes live. In each of those cases you are not buying speed; you are buying tokens. And tokens are cheaper after dark.

A long industrial conveyor line in a warehouse with metal parts queued in even rows. — Batch is the night shift: the same work, done by the same machines, when nobody is waiting at the counter. Photo by Trans Russia on Unsplash.

The four vendors, side by side

Every major hosted LLM provider now runs the same trade. The shape is almost identical: an asynchronous endpoint, a 50% rate, a 24-hour ceiling, results in a file. The differences are in the limits. OpenAI caps a single batch at 50,000 requests or 200 MB; Anthropic goes higher at 100,000 requests or 256 MB; Google’s inline path accepts payloads up to 20 MB and file uploads up to 2 GB; Bedrock passes everything through S3 buckets.

Coverage is broad. Anthropic batches every active Claude model. OpenAI batches Responses, Chat Completions, Embeddings, Moderations, Images, and Video endpoints. Google batches Gemini models across the same modalities. Bedrock supports Claude, Llama, Mistral, Amazon Nova / Titan, Cohere, and AI21 through a single batch interface. There’s no model you’d plausibly want for an offline workload that isn’t batchable on at least one of these providers.

Sort your AI workloads

Live tier · full price

Chat reply
User typed, spinner spinning
Code completion
IDE popup, tab to accept
Live RAG answer
Search result on the page
Voice agent turn
Phone call, latency budget <800ms

Batch tier · 50% off

Eval run
Score 5,000 prompts before standup
Classification backlog
Tag a year of support tickets
Embeddings backfill
Re-embed corpus on model change
Catalog content
Product descriptions for new SKUs
Nightly digest
Summarize today's docs by 7am
Dataset labeling
Build training data for a fine-tune

What batch is actually for

Patient workloads are the dull half of the AI bill. They’re also the larger half at most companies, once you measure tokens instead of product surface area. A few concrete cases that are almost always batch-shaped:

Evals. Scoring a prompt change against a held-out set of 5,000 examples is exactly what batch is for. Nobody is waiting on response 3,217. You want results in your dashboard by morning.
Bulk classification. Tagging a year of support tickets, labeling 200K product images for an alt-text rollout, categorizing a corpus of legal filings. Classification at scale is the canonical batch workload. Quora cited summarization and highlight extraction as their batch use case at launch.
Embeddings backfills. Re-embedding a corpus when you change models or chunking strategies. The downstream index rebuild doesn’t care whether the embeddings landed at 3am or 3pm.
Content moderation backlogs. Live moderation (chat, posts) belongs on the synchronous tier. The post-hoc sweep over historical content that nobody flagged the first time doesn’t.
Generated content for catalogs. Product descriptions, alt text, marketing copy variants, release-note drafts. The deadline is the publish date, not the per-row generation moment.
Dataset labeling. The training data pipelines that feed your fine-tunes, your distillations, your synthetic augmentations. Producers of training data should never pay live prices.

A useful mental model: anywhere you would have once written a cron job, batch is the AI equivalent. The fact that the discount is 50% and not, say, 5% tells you the vendors actually want you off the synchronous fleet for this work. It frees capacity for the requests where every millisecond matters and somebody is staring at a spinner.

A server hall corridor at night, indicator lights glowing dim blue along stacked racks. — Batch jobs run on the same hardware as the live tier, just when the live tier has spare capacity. Photo by Taylor Vick on Unsplash.

The honest gotchas

Batch is not a free lunch. The 24-hour ceiling is the obvious cost; the less obvious ones are where most teams trip:

The 24-hour ceiling is a ceiling, not a guarantee. Anthropic explicitly notes that under high demand “you may see more requests expiring after 24 hours.” OpenAI uses softer language: completion is “often” faster, but the 24-hour limit is hard. Design for retry on expiration, not just on error.
No streaming. No live tool calls. Batch returns a completed response, not a token stream. Any workflow that depends on intermediate tokens (live tool use, interactive agents asking the user a follow-up) can’t be batched. The batch endpoint is for one-shot completions and tool-augmented single-turn requests where the loop is closed.
Partial failures are normal. A 50,000-request batch is going to have a few hundred rows that hit a content filter, exceed the context window for a particular input, or time out. The result file flags those rows; your downstream code has to treat them as a retry queue, not as a fatal error.
Rate-limit and spend-limit interactions are unusual. Anthropic notes that batches can briefly exceed your workspace spend limit because of concurrent processing. The batch tier has its own rate limits separate from the live tier, which is mostly good news, but means a single oversized batch can still get queued.
No Zero Data Retention on Anthropic. The Message Batches API is not eligible for ZDR; the standard 30-day retention applies. Most batches are internal-data jobs where this is fine, but the legal team should know which workloads moved.
Results expire. Anthropic keeps batch results for 29 days; OpenAI auto-deletes the output file 30 days after completion. Download and persist the results in your own storage if you need them long-term.

None of these are deal-breakers. They’re the kind of constraints you’d expect of an asynchronous API. The point is that “just run it on batch” is not entirely a one-liner: you need a job runner, an error-handling shape, and a place for the result files. Most teams already have all three.

Stack it with caching for 75–95% off

The two discounts compose. Batch is 50% off any token; prompt caching is 50–90% off any token that hits a cached prefix. If a request does both, the multipliers stack. Anthropic publishes the math: on Claude Sonnet 4.6, a cached input token in a batch request costs $0.15 per million against the standard $3 input rate : a 95% discount. On Haiku 4.5 the stacked rate is $0.05 per million. OpenAI stacks to roughly 75% off on cached batch inputs because its cache discount is the smaller 50%.

That changes the workload math more than the headline suggests. A classification batch with a 10,000-token system prompt and 200,000 rows used to pay full price on the system prompt 200,000 times. With caching alone the system prompt cost drops 90% across the batch. With caching plus batch, the same prompt is 19× cheaper than the synchronous, uncached version. The companion piece on prompt caching as the biggest discount in your AI bill walks through the caching mechanics; the short version is: pin the stable bytes (tools, system prompt, retrieved documents) to the front of every request and set the cache-control breakpoints before you submit the batch. Anthropic recommends the 1-hour cache TTL inside batches specifically because batch processing can outlast the default 5-minute window.

Cost per million input tokens · cached prefix on Claude Sonnet 4.6

Discounts multiply, not add. The cheapest input token on a frontier model in 2026 is 1/20th the headline rate.

Synchronous, no cacheClaude Sonnet 4.6 input

$3.00 / M

Batch only50% off

$1.50 / M

Caching only90% off cached read

$0.30 / M

Batch + cachingStacked, 95% off

$0.15 / M

Local doesn’t have a fast lane

The whole urgent-vs-patient distinction is a hosted-API artifact. On a local model, every request runs on the one GPU you have. There is no separate batch fleet to migrate to. The pricing is whatever your laptop’s wall power says: a few cents an hour for sustained inference on a modern Apple Silicon machine, less still on a Linux box with an idle 4090. The economics flip: there’s no synchronous tax to dodge, just the question of whether you want to wait for the model on your machine or pay a cloud bill to skip the wait.

That’s the case for the split CSuite uses by default: local models for the constant trickle of small tasks (rewrites, translations, single-document summaries, voice transcription), batch APIs for the rare big jobs (corpus-scale classification, eval runs, embeddings backfills), and the live synchronous tier kept narrow: only the things a human is genuinely staring at. The post on cost-per-task accounting argues this re-sorting at the workload level; batch is the lever for the third pile.

A desk lamp lighting a cluttered workspace late at night, paperwork stacked under warm light. — A 24-hour SLA isn’t patience. It’s scheduling. The job starts when the deadline allows it, not when the request is submitted. Photo by Haewon Oh on Unsplash.

Sort your workloads tomorrow

The exercise is shorter than the post. Pull your last month of AI usage, group it by feature, and put a check next to every feature where a human is staring at the response. Everything without a check is candidate batch traffic. Pick the largest one by token volume and rewrite it to submit a batch instead of a stream of synchronous calls. You’ll know within a day whether the latency is tolerable; if it is, the bill for that workload halves the next billing cycle, and you can stack caching for another 10–40% on top.

AI vendors are pricing your patience because they’d rather have the option to schedule your work than have you fight live traffic for capacity. That’s a deal you can take or leave; what you can’t do is pretend it’s not on offer. The two columns are right there on the pricing page. Most of your work belongs in the right one.

Most AI work can wait. Your bill shouldn't.

Two columns, half the bill

A test for “is this urgent?”

The four vendors, side by side

What batch is actually for

The honest gotchas

Stack it with caching for 75–95% off

Local doesn’t have a fast lane

Sort your workloads tomorrow

Sora vs Veo vs Kling in 2026: one shutdown, one successor, one survivor

ByteDance models with real examples: Seedream and Seedance

Most AI apps are wrappers, and you're paying the markup

One-time payment. Yours forever.