What one hour of serious AI use actually costs
I priced a real mixed hour of AI — chat, images, video, voice — across the providers people use. The total was $13.57, and 88% of it was one clip.
Per-token pricing is useless to a human being. Nobody wakes up wanting to buy a million tokens. They want to answer twenty questions, read a long contract, make a few images, cut a short video, narrate a script, and transcribe a meeting — an hour of real, mixed work. So I priced exactly that. One hour of serious, multi-modal AI use, run against the public 2026 rates of the providers most people actually reach for. The total came to $13.57. The surprise wasn’t the size of the bill. It was the shape of it.
An hour costs about $13.57 — and you’ll guess wrong about why
Ask anyone who hasn’t looked at an API invoice to guess where the money goes in an hour of AI work, and they say “the chatbot.” It’s the part they touch most, so it feels expensive. It’s the cheapest line on the receipt. Twenty back-and-forth chat exchanges plus a fifty-thousand-token document analysis — the entire thinking workload of the hour — costs about seventy cents on a flagship model. The eight images cost a third of a dollar. The voiceover and the transcription, together, cost fifty-one cents.
Then you generate two minutes of video, and the bill jumps to thirteen and a half dollars. One clip. A single short asset costs more than everything else in the hour combined, by a factor of eight. The hour isn’t expensive. The hour has one expensive thing in it, and it’s the thing you spent the least time on.
Put it in human terms. Everything you’d call “the work” — the thinking, the reading, the pictures, the talking, the listening — costs about $1.57, less than a vending-machine coffee. The two-minute video costs the rest, about as much as lunch. If that ratio feels backwards, it’s because the parts that feel effortful to a person are nearly free to a machine, and the part that feels like magic is the one still priced like a film crew. The whole post is an argument for internalizing that one inversion before you ever read an invoice.
What “one hour of serious use” actually means
A per-hour number is only as honest as the workload behind it, so here’s the whole definition, in the open, before any prices. This is a deliberately mixed hour — the kind a solo founder, a marketer, or a curious power user might actually put together in an afternoon, touching every modality instead of living in the chat box. Challenge the mix if you like; that’s the point of publishing it.
These are conservative, middle-of-the-road quantities. The chat figure assumes 800 input and 600 output tokens per exchange — a real conversation with context, not one-liners. The long-context job is a single fifty-page document summarized to a page. The video is two minutes at 720p, which by 2026 standards is a modest ask. Nothing here is cherry-picked to flatter or inflate any provider. The workload is fixed; only the price tags move.
Three rules keep the comparison honest. Every price is the public, list-rate, pay-as-you-go cost — no committed-spend discounts, no free tiers, no promotional credits. Every job is a single shot with no prompt caching and no async batch lane, both of which can halve or quarter the cloud cost but only for workloads shaped to use them. And every number traces to a vendor page I actually pulled, linked in the sections below. If your real hour reuses a long system prompt all day, or runs overnight on a batch queue, your bill will be lower than these figures — the point here is the unoptimized baseline everyone starts from.

The 2-minute video clip is 88% of the bill
Here is the entire post in one chart. Six modalities, one hour, and a single bar that one segment swallows whole.
At OpenAI’s published rate of $0.10 per second, Sora 2 turns two minutes of 720p video into a $12 line item. That is not an outlier. Google’s Veo 3.1 Standard runs $0.40 a second — $48 for the same two minutes. The cheaper end of the market doesn’t rescue you either: Kuaishou’s Kling 2.5 Turbo on fal.ai is $0.07 a second, and Runway’s Gen-4 Turbo bottoms out around $0.05. Even at the floor, two minutes of video costs six dollars — still more than every other modality in the hour put together.
The reason is physical, not arbitrary. A second of generated video is thousands of generated frames, each one effectively an image, with temporal coherence stitched across them. Video pricing is image pricing multiplied by a frame rate and a coherence tax. The unit of video isn’t the token; it’s the second, and the second is expensive because it contains so much. Until that math changes, video will dominate any mixed AI bill that contains it — which means the single most important budgeting question isn’t which chat model you picked. It’s whether this hour generates video at all.
Scale it down and the trap gets clearer. A single eight-second shot — the length of one cut in a social ad — runs about $0.80 on Sora 2 and $3.20 on Veo Standard. That feels cheap, so you generate it five times to get the take you want, and now one usable shot cost $4 to $16. A two-minute sequence is just fifteen of those shots, and the re-rolls are where the real money hides. Generated video is the one modality where “just try a few more” is a budgeting decision, not a free habit. The text and image modalities forgave that instinct years ago; video still charges studio rates for the privilege of changing your mind.

Text got so cheap it rounds to zero
The thinking part of the hour — twenty real chat exchanges plus a fifty-page document analysis — is the part people assume costs the most. Across the six most-used text providers, the entire text workload of the hour costs somewhere between four cents and seventy-two cents. Not per query. For all of it.
DeepSeek prices the whole hour of text at four cents. xAI’s Grok 4.3 comes in at twelve. Gemini 3.1 Pro is twenty-nine cents. Even the top of the range — OpenAI’s GPT-5.5 at $5/$30 per million tokens, or Claude Opus at $5/$25 — lands under seventy-five cents for the hour. The spread between the cheapest and priciest text provider is 18×, which sounds dramatic until you notice that 18× of nothing is still nothing. The gap between the cheapest and dearest text model is sixty-eight cents. The video clip costs $12. Choosing your chat model on price is optimizing the rounding error.
This is worth sitting with, because it inverts the instinct most teams carry into AI budgeting. The conversation has spent two years arguing about which frontier chat model is cheapest per token. That argument was always real but never large. Pick the text model that does your job best; the price difference will not show up on the receipt for an hour like this one. We made the broader version of this case in cost-per-task, not cost-per-token — per-hour pricing is the same argument pointed at your whole workflow instead of one job.
Images and audio tell the same story in a different currency. Eight 1024-pixel images run $0.09 on OpenAI’s cheapest tier, $0.32 on Black Forest Labs’ Flux 1.1 Pro or Google’s standard Imagen 4, and at most $1.34 if you splurge on the high tier. Five minutes of narration is $0.07 on Google’s cheap text-to-speech and $0.45 on ElevenLabs; ten minutes of transcription is three to six cents on AssemblyAI or OpenAI’s Whisper. Add the most expensive option in every non-video category and the hour’s images, voice, transcription, and text together still come to under three dollars. There is exactly one line on this receipt that can ruin your budget, and you already know which one it is.
The spread is 8× — and it’s almost all your video model
Run the same fixed hour three ways — cheapest credible options, popular defaults, premium picks — and the total swings from $6.17 to $50.84. An 8× spread on an identical workload. But look at where the swing lives.
In all three baskets, video is between 89% and 97% of the total. Subtract it, and the “cheapest” and “premium” hours that looked 8× apart collapse to $0.17 versus $2.84 — both under three dollars, both rounding error next to a single clip. The $44 gap between the cheap basket and the premium basket is almost entirely the difference between Runway’s $6 of video and Veo’s $48. Every other modality choice you agonize over — flagship versus mini chat model, premium versus standard image tier, ElevenLabs versus Google’s cheap TTS tier — moves the hour by cents. The video model moves it by tens of dollars.
The practical takeaway is almost embarrassingly simple. If your hour has no video in it, every provider on the market prices it at roughly a dollar, and you should choose on quality and privacy, not price. If your hour does have video, that one decision is your entire budget, and it’s the only price you need to shop hard.

Where the per-hour math breaks
A single number is a useful handle, not a universal law. The $13.57 holds for one person doing a mixed hour at list prices. Four things bend it, and they bend it in opposite directions depending on who you are.
- For individuals, the hour is roughly fixed.You can’t negotiate, you pay list, and the per-hour cost is stable enough to budget against. Pick on quality and the modalities you actually use; the price will follow the workload, not the vendor.
- For teams, it compounds linearly and starts to matter. Ten people doing video-heavy hours all day is a four-figure monthly line. The lever isn’t the chat model; it’s a video budget and a rule about when a generated clip is worth $12. Async work helps too — the major vendors cut prices roughly in half for jobs that can wait a few hours, which we covered in most AI work can wait.
- For enterprises, the chart stops applying.Committed spend, negotiated rates, and private deployments break public list pricing entirely. The per-hour number is a baseline to negotiate down from, not the price you’ll pay.
- For local-first users, the hour bends toward zero. The text, image, and audio modalities — everything but the video clip — already run on consumer hardware for the cost of electricity. That’s the $1.57 of the receipt that quietly disappears when the model lives on your machine. We did the cost-crossover arithmetic in personal compute is back; video is the one modality still firmly worth renting.
How to price your own hour
You don’t need my workload — you need yours. The exercise takes ten minutes and changes how you read every AI invoice afterward.
- Write down a real hour of your own work,by modality. How many chats, how many images, how many seconds of video, how many minutes of audio. Be honest about the video — it’s the only number that will dominate the result.
- Price the non-video modalities once and stop worrying about them. Text, images, and audio for a normal hour land under a few dollars on every provider. Choose them on quality, latency, and where your data goes — not on a rate card.
- Shop the video model hard, or cut the video. This is the entire budget. Halving your seconds of generated video saves more than switching every other tool you use combined. A $6 model versus a $48 model on the same clip is the only price negotiation that pays.
- Move what can wait to a batch lane, and what can run locally off the meter. Async batch halves the cloud cost of anything that doesn’t need an instant answer; local inference takes the non-video modalities to electricity. Stack both and the hour’s payable portion shrinks to the video alone.
The headline number is a conversation-starter, not a verdict: one hour of serious AI use costs about $13.57 at 2026 list prices. But the number you should actually remember is the shape underneath it. An hour of AI thinking, seeing, listening, and speaking costs about a dollar and a half. An hour with two minutes of video in it costs ten times that. Price the hour you actually work, and the budget writes itself — it’s the clip, and almost nothing else.


