Cost-per-task, not cost-per-token
Your AI bill probably has the wrong column header on it. Six realistic tasks, six providers, and a table that re-ranks the leaderboard once you stop measuring in bytes and start measuring in outcomes.
Open any AI procurement deck written in the last two years and the first chart is the same one: a bar comparing providers on dollars per million tokens. That number is on every vendor’s landing page, every cost-comparison spreadsheet, every Reddit thread arguing which model is cheapest. It is also the wrong unit. Nobody buys tokens. People buy outcomes — a summarized PDF, a reviewed pull request, a transcribed meeting, a cleaned CSV, an answered support ticket — and the price they actually pay for each of those is what the bill is made of. Re-rank the providers by cost-per-task and the leaderboard rearranges itself in ways the per-token chart never warned you about.
The wrong unit
Two models priced identically per token can sit a factor of four apart on the same job. A model priced 3× higher per token can finish first try where a cheaper one stalls, retries, or hallucinates a wrong answer that you eat anyway. The per-token chart treats output as a commodity — bytes in, bytes out, dollars per byte. The real economics of language models don’t work that way. Output length varies wildly by model temperament. Reasoning models pad with intermediate tokens nobody reads but everybody pays for. Quality differences hide as “same answer, different word count.”
The fix is operational, not mathematical: define the workload, measure the dollar cost end-to-end, and rank on that. Token rates become the denominator, not the numerator. This is how every other recurring spending decision in a business gets made — cost per delivery, cost per query, cost per signed contract — and there’s no reason AI should be the one line item that gets benchmarked in raw consumption units.
What follows is a real cost-per-task table for six realistic jobs across six providers, computed from public 2026 pricing and measured token counts on actual runs. No synthetic benchmarks. Where the table flatters the headline ordering, I’ll say so. Where it inverts it, we’ll talk about why.
Six tasks, six providers
The numbers above come from real runs against Anthropic, OpenAI, Google, Mistral, and DeepSeek published API rates as of May 2026, plus a measured local run of Llama 3.3 70B at Q4_K_M on an M4 Max MacBook Pro. Every cell is token-in×rate-in + token-out×rate-out, rounded to the cent or sub-cent. Output lengths are the median across three runs on each model so the comparison reflects each one’s native verbosity.
The 30-page-PDF row is the one most people have an intuition for: 15K input tokens, ~500-token executive summary out. The cloud spread is 13× from Flash to Opus. Pick a different row and the ratio changes, and so does the cheapest provider. There is no single column-winner across the table — which is exactly the point. The per-token leaderboard pretends there is.
Two patterns jump out. First, the cheap tier (Flash, Haiku, DeepSeek) wins anything where output is short. The dollar value of an output token is 4–5× an input token across most providers, so any task with a small output collapses toward the input-rate gap, where cheap models are ten times cheaper than the flagship line. Second, the spread closes once the output gets fat. The 2,000-word essay row and the CSV row are dominated by output, and the cheap-model advantage shrinks from 13× to ~5×. Picking by token rate alone systematically over-rewards cheap models on long-output workloads they often handle worse.

Reasoning models flip the ranking
The most-misread row in the table is the PR review. A 500-line diff plus a CLAUDE.md, plus a request for a structured review with bug triage and a confidence score, is a hard reasoning task. Run it on a cheap model and you get a passable list of nits. Run it on a reasoning model and you get the actual race condition someone introduced two commits ago.
Reasoning models cost more per token because they emit hidden intermediate tokens — OpenAI bills these as output tokens on the o-series, and Anthropic does the same for extended thinking on Claude Opus 4.7 and Sonnet 4.6 with thinking turned on. A single Opus 4.7 review with 8K of thinking can cost $0.75 where a Haiku review costs $0.05. That sounds catastrophic until you ask the second question: how many cheap-model reviews would you have to throw away and retry before Opus pays back? On the PRs that actually had a bug, the answer in measured runs sits between 3 and 8. Reasoning models don’t look cheaper on the per-token chart and don’t look cheaper on the cost-per-task chart for tasks they didn’t need to be on. They look dramatically cheaper for tasks where the alternative is a wrong answer you can’t catch.
The framing this implies is unfashionable but boring: match the model to the task. Daily summarization, formatting, and routine drafting runs on the cheap tier. Anything where correctness is hard to verify cheaply — security review, contract analysis, multi-step debugging, long planning — goes to the reasoning tier and the per-token premium is the price of getting it right the first time.
The retry tax
The retry tax is the part nobody invoices but everybody pays. A cheap model that hallucinates a wrong customer-support answer doesn’t save you $0.01; it costs you a refund, a follow-up email, and a churned account. A code-review model that misses the actual bug doesn’t save you $0.40; it costs you the production incident. These aren’t marginal effects — they swamp the per-token cost difference by orders of magnitude on the tasks where they hit.
The chart above models a single task — the PR review — with three retry assumptions. At a 0% retry rate (perfect first-pass accuracy on every run), Haiku wins on cost-per-task by a wide margin. At a 20% retry rate (one in five reviews has to be redone, which is a conservative estimate for cheap models on hard reasoning), Sonnet 4.6 pulls level. At a 50% retry rate (closer to what cheap models actually deliver on adversarial code), Opus 4.7 wins outright. None of the per-token rates changed. The workload reality did.
Most teams I’ve seen settle the question empirically: run the same week of work on two tiers, measure outcomes, and rank by what actually shipped. The per-token leaderboard never survives contact with this exercise. The dollar number that matters is the one stamped on the closed ticket, not the one on the API invoice.
Where the table breaks
The table is a starting point, not the final answer. Three classes of workload bend it in directions worth knowing before you ship a forecast based on it.
- Prompt caching shifts the math 50–90%.Any workload with a stable prefix — agent loops, RAG with a fixed system prompt, multi-turn chat — lands on cache reads from the second call. The effective input rate on the cached portion drops to 10% on Anthropic and Google, 50% on OpenAI. For a long-running agent that re-sends the same 50K system prompt all day, the per-task cost halves or quarters. We’ve written the mechanics up separately, so I won’t repeat them here — the implication for this table is that the “cloud” row gets materially cheaper the more your traffic looks like a real agent instead of one-shot prompts. See the prompt-caching post for the discount mechanics.
- Batch APIs cut input price in half. Anthropic, OpenAI, and Google all run async batch lanes at 50% off with a 24-hour SLA. Anything that doesn’t need a synchronous answer — nightly summarization, bulk classification, periodic report generation — should be on the batch lane. The table shows interactive pricing; for batch workloads, halve every cloud cell.
- Multimodal tasks aren’t token-priced cleanly. Image generation, speech-to-text, and video are billed per asset, per minute, or per pixel, not per token. The marketing-image row in the table I left off precisely because it’d be apples to oranges next to the text rows. For a real cost model, build the multimodal columns separately and add them as fixed cost components to whichever task pulls them in.
The local column
The most awkward column in the table is the rightmost one. Local Llama 3.3 70B at Q4 on an M4 Max draws roughly 80 W under sustained inference, which at US grid prices is about $0.013 per kilowatt-hour of electricity. A 30-page-PDF summary on local hardware completes in around 90 seconds; the electricity cost is $0.00026. The CSV cleanup runs at 4 minutes; that electricity is $0.0007. Even the long essay, at 15 minutes of sustained generation, is two-tenths of a cent.
Rolling these into the table at face value would make every row look like Llama wins by four orders of magnitude. That isn’t the honest claim. The honest claim is: marginal cost is approximately zero, but the amortized cost of the hardware is real, and the quality-per-task on hard reasoning lags the cloud frontier by roughly six months. We did the cost-crossover arithmetic in the personal-compute post: medium-traffic users cross over inside two years, heavy users inside three months. For everyone past that crossover, the cost-per-task number is functionally zero on every routine task — the electricity is below the noise floor of the AWS bill.
The strategic point isn’t local versus cloud. It’s that once the local column exists at all on your task table, the right decision for any given workload stops being “cheapest cloud provider” and becomes “can this task run locally, and if not, which cloud provider wins on cost-per-task for the subset that can’t?” The hybrid stack is the cheapest stack on every rung of the table.
What to change Monday morning
The smallest concrete change is to redo your AI budget on a real cost-per-task table instead of a token-rate slide. The work is half a day:
- List five to ten actual tasks— not benchmark tasks, not synthetic ones, the things your team or your product actually does. A few real prompts and a representative input is enough; perfect coverage isn’t the goal.
- Run each task through three tiers per provider — cheap, mid, flagship — and record the input tokens, output tokens, latency, and a subjective quality score on a 1–5 scale. The whole sweep takes a couple hundred dollars on cloud and a few hours of clock time.
- Compute cost-per-task at each tier using current public rates. Tag any tier-task pair below your quality threshold (call it 4/5) as ineligible; the cheapest eligible pair is your recommended routing for that task.
- Layer caching and batching on top last, not first. The point is to choose the right model before discounting it. Discounting the wrong model is how teams end up locked into cheap tiers that quietly fail half their hard tasks.
- Add the local columnfor tasks you run frequently. Even if you don’t deploy local immediately, the column is honest about where the floor is and which cloud bills are worth paying.
The first time most teams do this, the savings show up not as a provider switch but as a routing change inside the same provider — sending the easy 80% of tasks to Haiku or Flash and reserving Sonnet or Pro for the hard 20%. The second-order effect is more valuable: once the conversation is about cost-per-task, vendor comparisons stop being about token rates and start being about outcomes. That’s the column header your AI bill deserves.


