ExplainerTokensEconomicsJune 10, 202611 min read

What is a token? The thing your AI actually reads, and bills you for

Why does AI say “strawberry” has two r’s? It can’t see letters, only tokens. The one unit behind its mistakes and your bill.

What the model actually reads

10 letters · 3 tokens

You see letters. The model sees tokens.

What you type

What the model gets

raw

berry

Three of those letters are an r. The model is asked to count them in a row it was never handed. It only ever received st, raw, berry.

Ask almost any chatbot how many times the letter r appears in “strawberry,” and for a long stretch of 2024 and 2025 it answered “two.” The correct answer is three. People posted the screenshots as proof that AI was dumb: a machine that can write a sonnet but can’t count to three.

It isn’t dumb. It is, in a precise and fixable way, blind. The model never received the word as the letters s-t-r-a-w-b-e-r-r-y. It received three chunks (st, raw, berry) and was asked to count something inside a thing it cannot see. Those chunks have a name, and once you understand it, a surprising amount of AI stops being mysterious: the dumb mistakes, the strange bills, the reason a sentence in Hindi costs more than the same sentence in English.

A token is the basic unit of text an AI model reads, writes, and bills you for: usually a chunk of a few characters, not a whole word and not a single letter. Everything a language model does, it does in tokens. This post is the plain-English tour: what a token is, why it explains the famous failures, where tokens come from, and why they sit on every invoice you will ever pay for AI.

Your AI reads in tokens, not letters

Humans read text as letters that build into words. A language model doesn’t. Before a single word of your prompt reaches the model, a step called tokenization chops it into tokens: short, common chunks of characters drawn from a fixed list the model learned once and never changes. The model only ever sees that stream of tokens. Letters, as such, are not its alphabet.

How big is a token? In English, the rough rule is about four characters, or three-quarters of a word. Google states it plainly in its developer docs: “a token is equivalent to about 4 characters,” and “100 tokens is equal to about 60–80 English words.” A short, common word like “cat” is one token. A longer or rarer word is several. The chunking isn’t random; it follows how often the pieces show up in ordinary text.

That single design choice is the hinge this whole post turns on. It is why the model is fluent and fast, why it costs what it costs, and why it trips over tasks a six-year-old finds trivial. Keep the picture in your head: you hand the machine words; it quietly swaps them for tokens before it reads a thing.

The conversion you can do in your head

Characters

~4 characters

→

1 token

Words

1 word

→

~1.3 tokens

This post

~2,000 words

→

~2,700 tokens

Rules of thumb for English, from Anthropic and Google: roughly 4 characters or three-quarters of a word per token. Other languages run far less efficiently. More on that below.

Why it can’t count the r’s in strawberry

Now the strawberry mystery solves itself. To the GPT-4o tokenizer, the word splits into three tokens (st, raw, and berry): a breakdown documented in a 2024 paper, “Why Do Large Language Models Struggle to Count Letters?” The model is handed those three chunks and asked how many r’s are inside. The letters were dissolved into the chunks before it ever looked. It is guessing at a quantity it was structurally prevented from seeing.

This is the root of a whole family of failures. Reversing a word, spotting a simple letter-swap cipher, splitting a word into syllables, rhyming on an exact ending, doing arithmetic digit by digit: anything that needs the characters inside a token tends to wobble, because that information was compressed away. A second 2025 paper, “The Strawberry Problem,” puts it cleanly: tokenization “severs the connection between words and their characters.”

Researchers are honest that tokenization isn’t the only culprit: the counting paper argues the attention math and the model’s internal sizing also limit how high it can reliably count. But the telling proof is the fix: when the Strawberry Problem team bolted on a mechanism that lets a token look back at its own characters, near-zero accuracy on letter tasks shot up. Give the model eyes for letters and the problem clears. The blindness was the point.

A close-up cluster of ripe red strawberries. — You can count the seeds. The model can’t count the r’s: it never receives the word one letter at a time. Photo by Jez Timms on Unsplash.

Where tokens come from: a vocabulary of pieces

Tokens aren’t invented on the fly. Before training, the model builds a fixed vocabulary (typically tens of thousands of tokens) using an algorithm called byte-pair encoding, or BPE. The idea is almost childishly simple. Start with the smallest pieces, then keep merging whichever pair shows up most often into a new, bigger piece, over and over, until you hit your target vocabulary size.

Hugging Face’s tokenizer course walks through it: the first merges glue two characters together; as training goes on, the merges build longer and longer subwords. The most common words end up as a single token because they earned it by frequency. Rare words never get their own slot, so they’re spelled out from smaller pieces, which is exactly why “strawberry,” common as a fruit but rare as a string of characters, lands as three chunks instead of one.

Modern GPT-style models use a byte-level variant. The base vocabulary is the 256 possible bytes, which guarantees there is no input (no emoji, no obscure script, no typo) that the tokenizer chokes on. Anything unfamiliar simply falls back to more, smaller tokens. A tokenizer, in other words, is a printer’s type case: a fixed drawer of reusable pieces, big ones for the words you use constantly and a pile of little ones to set anything else.

A wooden tray of metal movable-type printing blocks. — A tokenizer is a type case: a fixed drawer of reusable pieces you assemble every text from. Common pieces are big; rare ones get spelled out from small ones. Photo by Darren Ee on Unsplash.

The token is the unit on your invoice

Here is where tokens stop being trivia and start touching your wallet. Every major AI provider prices by the token, and they split the bill in two: tokens you send in (your prompt) and tokens the model writes back (its answer). The Claude API price list reads in dollars per million tokens: Haiku 4.5 at $1 in and $5 out, Sonnet 4.6 at $3 and $15, Opus 4.8 at $5 and $25. Notice the pattern: output costs five times input, everywhere.

Translate that into work. A 500-word email is about 667 tokens. The 4,000-word report below is roughly 5,300. Multiply by the rate and the sums are tiny per call, which is the whole trap: they’re invisible until you run the job ten thousand times. This is why the smart way to compare models is by the cost of a finished task, not the cost per token: a pricier model that answers in fewer tokens can come out cheaper on the actual job.

One job, three price tags

Summarize a 4,000-word report into 300 words: about 5,300 tokens in, 400 tokens out.

Model

$/M in

$/M out

This job

Haiku 4.5

~$0.007

Sonnet 4.6

$15

~$0.022

Opus 4.8

$25

~$0.037

Estimated from Claude API prices (June 2026). Output tokens cost five times input on every model, so the short summary pulls more weight than its length suggests.

One number should make the point stick. When Anthropic shipped a new tokenizer with Opus 4.7, it noted the change “may use up to 35% more tokens for the same fixed text.” Same words, same model family, a third more tokens (and therefore a third more cost) from a quiet swap in how text gets chopped. Tokens are the meter. If you can’t see them, you can’t see the bill. (The cheapest tokens of all are the ones you reuse: a cache hit costs a tenth of a fresh input token.)

In another language, the same sentence costs more

Tokenizers learn their pieces mostly from English-heavy text, so English gets the efficient, single-token words. Other languages get the leftovers. The same sentence, translated, breaks into more tokens, sometimes far more. A widely cited 2023 study by Petrov and colleagues measured the gap at up to 15× across languages for identical meaning.

That gap is not academic. Because you pay per token, a user writing in Burmese or Telugu can be charged several times what an English user pays for the same request. They also wait longer, since more tokens take more time to generate, and they hit the context-window ceiling sooner because their text eats more of it. The token is a unit of cost, speed, and capacity all at once, and it isn’t handed out evenly.

Same meaning, more tokens · cost relative to English

English

1× baseline

French / Spanish

~2×

Hindi / Arabic

up to ~5×

Burmese / Shan

up to ~15×

Representative ranges from Petrov et al. (2023), which measured up to 15× more tokens for the same text across languages. More tokens means a bigger bill, slower replies, and less that fits in the context window.

It compounds with the length problem, too: longer text means more tokens, and the more tokens you stuff in, the less reliably models use the ones in the middle: the gap between an advertised context window and a usable one. Every one of these limits is denominated in the same currency.

A wooden wall with the word welcome written in many different languages. — The same greeting, every language, but not the same token count. Write to a model in some scripts and you pay several times over for the identical meaning. Photo by Markus Kammermann on Unsplash.

Pictures and sound are tokens too

Tokens aren’t only for text. When a model can see and hear, images and audio get converted into tokens as well: the same currency, so everything lands in one context window and one bill. A picture genuinely is worth a thousand-ish tokens.

The numbers are concrete. In Google’s Gemini API, a small image counts as 258 tokens, video runs at 263 tokens per second, and audio at 32 tokens per second, so a one-minute clip is on the order of 15,000 tokens before anyone says a word. OpenAI’s vision models tile an image into 512-pixel squares: 85 base tokens plus 170 per tile. Upload a photo to a chatbot and you just spent more tokens than this paragraph.

Learn to see the tokens

You don’t need to count tokens by hand to benefit from knowing they exist. A few habits follow directly from everything above.

Estimate before you spend. Characters divided by four, or words divided by 0.75, gets you close enough to sanity-check a bill before it arrives.
Don’t ask a raw model to spell. For letter-level work (counting characters, exact reversals), give it a tool that runs real code, or break the word into spaced-out letters so each one becomes its own token.
Watch the bill in other languages. If your users write in non-Latin scripts, budget for multiples, not parity. The model fee is the same; the token count is not.
Remember images and audio are tokens. They fill the context window and the invoice just like text. A long screenshot can cost more than the question attached to it.

The token is the atom of modern AI. Pricing is measured in it, context limits are measured in it, and the quirks that make a brilliant model flub a kindergarten task all live at that layer. Learn to see the tokens underneath the words, and the machine stops looking like magic, and starts looking like something you can budget, debug, and trust.

Tokens: quick answers

Is a token the same as a word?

No. In English a token averages about three-quarters of a word. Common words (“the,” “and”) are a single token; longer or rarer words get split into several pieces. Spaces and punctuation count too.

Why does AI miscount the letters in a word?

Because it never sees the letters. Its smallest unit is the token, and a word like “strawberry” arrives as the chunks st, raw, berry, not as ten separate characters it can tally.

Do I pay per word or per token?

Per token, and input and output are billed separately. On current Claude models, every output token costs five times an input token, so a short answer to a long prompt can still dominate the bill.

Why does non-English text cost more?

The same sentence breaks into more tokens in most languages than in English, commonly 2–3×, and up to 15× for the worst-served scripts. More tokens, bigger bill, less room in the context window.

Strip away the jargon and a token is just the bite-size piece your AI actually chews on. It can’t taste the letters inside the bite, which is why it miscounts; it’s charged by the bite, which is why your bill behaves the way it does. See the bites, and the rest follows.

What is a token? The thing your AI actually reads, and bills you for

Your AI reads in tokens, not letters

Why it can’t count the r’s in strawberry

Where tokens come from: a vocabulary of pieces

The token is the unit on your invoice

In another language, the same sentence costs more

Pictures and sound are tokens too

Learn to see the tokens

Tokens: quick answers

How to write AI prompts that actually work

Why we built a desktop app in the browser era

What AI can actually do in 2026: a plain-English tour

One-time payment. Yours forever.