What is a token? The thing your AI actually reads — and bills you for
Why does AI say “strawberry” has two r’s? It can’t see letters — only tokens. The one unit behind its mistakes and your bill.
Ask almost any chatbot how many times the letter rappears in “strawberry,” and for a long stretch of 2024 and 2025 it answered “two.” The correct answer is three. People posted the screenshots as proof that AI was dumb — a machine that can write a sonnet but can’t count to three.
It isn’t dumb. It is, in a precise and fixable way, blind. The model never received the word as the letters s-t-r-a-w-b-e-r-r-y. It received three chunks — st, raw, berry— and was asked to count something inside a thing it cannot see. Those chunks have a name, and once you understand it, a surprising amount of AI stops being mysterious: the dumb mistakes, the strange bills, the reason a sentence in Hindi costs more than the same sentence in English.
A token is the basic unit of text an AI model reads, writes, and bills you for — usually a chunk of a few characters, not a whole word and not a single letter. Everything a language model does, it does in tokens. This post is the plain-English tour: what a token is, why it explains the famous failures, where tokens come from, and why they sit on every invoice you will ever pay for AI.
Your AI reads in tokens, not letters
Humans read text as letters that build into words. A language model doesn’t. Before a single word of your prompt reaches the model, a step called tokenizationchops it into tokens — short, common chunks of characters drawn from a fixed list the model learned once and never changes. The model only ever sees that stream of tokens. Letters, as such, are not its alphabet.
How big is a token? In English, the rough rule is about four characters, or three-quarters of a word. Google states it plainly in its developer docs: “a token is equivalent to about 4 characters,” and “100 tokens is equal to about 60–80 English words.” A short, common word like “cat” is one token. A longer or rarer word is several. The chunking isn’t random — it follows how often the pieces show up in ordinary text.
That single design choice is the hinge this whole post turns on. It is why the model is fluent and fast, why it costs what it costs, and why it trips over tasks a six-year-old finds trivial. Keep the picture in your head: you hand the machine words; it quietly swaps them for tokens before it reads a thing.
Why it can’t count the r’s in strawberry
Now the strawberry mystery solves itself. To the GPT-4o tokenizer, the word splits into three tokens — st, raw, and berry— a breakdown documented in a 2024 paper, “Why Do Large Language Models Struggle to Count Letters?” The model is handed those three chunks and asked how many r’s are inside. The letters were dissolved into the chunks before it ever looked. It is guessing at a quantity it was structurally prevented from seeing.
This is the root of a whole family of failures. Reversing a word, spotting a simple letter-swap cipher, splitting a word into syllables, rhyming on an exact ending, doing arithmetic digit by digit — anything that needs the characters inside a token tends to wobble, because that information was compressed away. A second 2025 paper, “The Strawberry Problem,” puts it cleanly: tokenization “severs the connection between words and their characters.”
Researchers are honest that tokenization isn’t the only culprit — the counting paper argues the attention math and the model’s internal sizing also limit how high it can reliably count. But the telling proof is the fix: when the Strawberry Problem team bolted on a mechanism that lets a token look back at its own characters, near-zero accuracy on letter tasks shot up. Give the model eyes for letters and the problem clears. The blindness was the point.

Where tokens come from: a vocabulary of pieces
Tokens aren’t invented on the fly. Before training, the model builds a fixed vocabulary — typically tens of thousands of tokens — using an algorithm called byte-pair encoding, or BPE. The idea is almost childishly simple. Start with the smallest pieces, then keep merging whichever pair shows up most often into a new, bigger piece, over and over, until you hit your target vocabulary size.
Hugging Face’s tokenizer course walks through it: the first merges glue two characters together; as training goes on, the merges build longer and longer subwords. The most common words end up as a single token because they earned it by frequency. Rare words never get their own slot, so they’re spelled out from smaller pieces — which is exactly why “strawberry,” common as a fruit but rare as a string of characters, lands as three chunks instead of one.
Modern GPT-style models use a byte-level variant. The base vocabulary is the 256 possible bytes, which guarantees there is no input — no emoji, no obscure script, no typo — that the tokenizer chokes on. Anything unfamiliar simply falls back to more, smaller tokens. A tokenizer, in other words, is a printer’s type case: a fixed drawer of reusable pieces, big ones for the words you use constantly and a pile of little ones to set anything else.

The token is the unit on your invoice
Here is where tokens stop being trivia and start touching your wallet. Every major AI provider prices by the token, and they split the bill in two: tokens you send in (your prompt) and tokens the model writes back (its answer). The Claude API price list reads in dollars per million tokens: Haiku 4.5 at $1 in and $5 out, Sonnet 4.6 at $3 and $15, Opus 4.8 at $5 and $25. Notice the pattern — output costs five times input, everywhere.
Translate that into work. A 500-word email is about 667 tokens. The 4,000-word report below is roughly 5,300. Multiply by the rate and the sums are tiny per call, which is the whole trap: they’re invisible until you run the job ten thousand times. This is why the smart way to compare models is by the cost of a finished task, not the cost per token — a pricier model that answers in fewer tokens can come out cheaper on the actual job.
One number should make the point stick. When Anthropic shipped a new tokenizer with Opus 4.7, it noted the change “may use up to 35% more tokens for the same fixed text.” Same words, same model family, a third more tokens — and therefore a third more cost — from a quiet swap in how text gets chopped. Tokens are the meter. If you can’t see them, you can’t see the bill. (The cheapest tokens of all are the ones you reuse: a cache hit costs a tenth of a fresh input token.)
In another language, the same sentence costs more
Tokenizers learn their pieces mostly from English-heavy text, so English gets the efficient, single-token words. Other languages get the leftovers. The same sentence, translated, breaks into more tokens — sometimes far more. A widely cited 2023 study by Petrov and colleagues measured the gap at up to 15× across languages for identical meaning.
That gap is not academic. Because you pay per token, a user writing in Burmese or Telugu can be charged several times what an English user pays for the same request. They also wait longer, since more tokens take more time to generate, and they hit the context-window ceiling sooner because their text eats more of it. The token is a unit of cost, speed, and capacity all at once — and it isn’t handed out evenly.
It compounds with the length problem, too: longer text means more tokens, and the more tokens you stuff in, the less reliably models use the ones in the middle — the gap between an advertised context window and a usable one. Every one of these limits is denominated in the same currency.

Pictures and sound are tokens too
Tokens aren’t only for text. When a model can see and hear, images and audio get converted into tokens as well — the same currency, so everything lands in one context window and one bill. A picture genuinely is worth a thousand-ish tokens.
The numbers are concrete. In Google’s Gemini API, a small image counts as 258 tokens, video runs at 263 tokens per second, and audio at 32 tokens per second — so a one-minute clip is on the order of 15,000 tokens before anyone says a word. OpenAI’s vision models tile an image into 512-pixel squares: 85 base tokens plus 170 per tile. Upload a photo to a chatbot and you just spent more tokens than this paragraph.
Learn to see the tokens
You don’t need to count tokens by hand to benefit from knowing they exist. A few habits follow directly from everything above.
- Estimate before you spend. Characters divided by four, or words divided by 0.75, gets you close enough to sanity-check a bill before it arrives.
- Don’t ask a raw model to spell.For letter-level work — counting characters, exact reversals — give it a tool that runs real code, or break the word into spaced-out letters so each one becomes its own token.
- Watch the bill in other languages. If your users write in non-Latin scripts, budget for multiples, not parity. The model fee is the same; the token count is not.
- Remember images and audio are tokens. They fill the context window and the invoice just like text. A long screenshot can cost more than the question attached to it.
The token is the atom of modern AI. Pricing is measured in it, context limits are measured in it, and the quirks that make a brilliant model flub a kindergarten task all live at that layer. Learn to see the tokens underneath the words, and the machine stops looking like magic — and starts looking like something you can budget, debug, and trust.
Tokens: quick answers
Strip away the jargon and a token is just the bite-size piece your AI actually chews on. It can’t taste the letters inside the bite, which is why it miscounts; it’s charged by the bite, which is why your bill behaves the way it does. See the bites, and the rest follows.


