Field GuideEvalsModelsJune 3, 20269 min read

The only AI benchmark that matters has 20 questions, and you write it

The model that tops every leaderboard can finish last on your work. The fix takes one afternoon and twenty examples you already have.

By Atul

The same five models, two rankings

Illustrative

The model that tops the public chart can finish last on your work.

Public leaderboard

Big-lab flagship#1

Crowd-pleaser#2

Quiet open-weight#3

Cheap + fast#4

Last quarter's star#5

Your 20-example eval

#1Quiet open-weight

#2Cheap + fast

#3Big-lab flagship

#4Last quarter's star

#5Crowd-pleaser

The reshuffle is the point, not the names. The only ranking that decides your pick is the one run on your inputs.

A new model tops the leaderboard. Your feed fills with charts. You switch, run it on the work you actually do, and it’s worse. Not catastrophically, just… off. The summaries miss the point your old model caught. The code compiles but ignores your conventions. You quietly switch back, a little embarrassed, and wonder whether you imagined the whole thing.

You didn’t. The leaderboard measured something real; it just wasn’t your work. The number that should drive your choice isn’t on any public chart, because nobody else has your inputs, your standards, or your definition of a good answer. The good news: you can produce that number yourself in an afternoon, using about twenty examples you already have lying around. This is the case for writing your own benchmark and retiring your faith in everyone else’s.

The model that tops the leaderboard probably isn’t your best pick

A public leaderboard is an average over thousands of strangers’ prompts. Your job is one narrow slice of that distribution: a legal tone, a house code style, a specific way you like meetings summarized. A model can win the average and lose your slice, and the chart will never tell you, because your slice was a rounding error in its total.

It’s worse than averaging away your use case. The most-cited ranking, LMArena’s Chatbot Arena, scores models on which answer a crowd prefers in a blind side-by-side. Crowds reward things that have little to do with whether the answer is correct. As Simon Willison noted in his write-up of the 2025 criticism, “model characteristics that resonate with evaluators… may not directly relate to the quality of the underlying model”: bullet lists, friendly padding, and answers of a flattering length all climb the board. A model tuned to please a crowd is not the same as a model tuned to do your job, and the leaderboard cannot tell the two apart.

Picture two models. One is a brilliant generalist that writes gorgeous prose and tops the chart. The other is mediocre at poetry, jokes, and trivia (the bulk of what a crowd throws at an arena) but unusually careful with numbers and citations. If your work is drafting financial memos, the second model is the right hire and the first is a liability, and the public ranking has them in the exact wrong order. The chart isn’t lying to you. It’s answering a question you never asked.

A black flip-board display with columns labelled TIME and NUMBER, showing rows of changing digits. — A board full of numbers that aren’t about you. Photo by Declan Sun on Unsplash.

The benchmarks behind the rankings are saturated, contaminated, or gamed

Even the static, academic benchmarks (the ones that feel objective because they have right answers) have mostly stopped ranking anything useful. Two forces broke them. The first is saturation: when every frontier model scores in the same narrow band near the top, the gaps shrink into statistical noise. MMLU, the benchmark that anchored model marketing for years, now sits between roughly 88% and 94% for every serious model. The second is contamination: the questions, or text closely derived from them, leak into training data, so the model recalls the answer instead of reasoning to it.

The benchmarks that anchored the charts · and why they stopped ranking

Benchmark

Born

Top scores

Status

MMLU

2020

88–94%

Saturated

HellaSwag

2019

>95%

Saturated

HumanEval

2021

>90%

Contaminated

GSM8K

2021

>95%

Contaminated

GPQA Diamond

2023

~78% → 90s

Closing fast

When the top of a benchmark fills up, the gap between models becomes noise, and a noisy gap can’t rank anything. Saturation and contamination figures via public benchmark trackers.

Then there’s outright gaming. In April 2025, a team from Cohere Labs, AI2, Princeton, Stanford and others published “The Leaderboard Illusion,” documenting how a handful of well-resourced labs quietly test many private variants of a model and disclose only the best score. The paper found Meta had tested 27 private Llama 4 variants before launch. Sampling was lopsided too: Google and OpenAI accounted for roughly 19% and 20% of all Arena battles, while 83 open-weight models combined shared under 30%, and the authors estimate that data advantage alone can translate into relative gains of up to 112% on the arena’s own distribution.

The Llama 4 episode made it concrete. The “experimental” build Meta submitted landed at number two; when the plain release version was added to the board on April 11, 2025, it ranked 32nd, below year-old models. LMArena changed its rules and banned the kind of crowd-tuned variant Meta had used. The board can be fixed; the lesson can’t be unlearned. A public number is a target, and targets get optimized.

A leaderboard can be gamed. A test you wrote can’t.

Goodhart’s law is the whole story here: when a measure becomes a target, it stops being a good measure. Every public benchmark is a target for every lab on earth, with billions of dollars and a launch-day headline riding on the score. Of course they get optimized. Of course the number drifts away from what it once tracked.

Your eval has exactly one property the public ones can never recover: no one is optimizing against it. A lab can’t cherry-pick a variant that beats questions it has never seen. It can’t pad for a crowd that is just you, grading your own outputs against your own standard. The set is private, drawn from your real inputs, and that privacy is its entire advantage. A model that scores well on twenty examples you kept to yourself scored well because it’s good at your work, not because someone aimed at the test.

Twenty examples from your own work beat ten thousand from someone else’s

The instinct is that a real benchmark needs to be big. It doesn’t. Anthropic’s own guidance for teams building on Claude recommends starting with 20 to 50 tasks drawn from real failures, noting that early changes have large effect sizes, so small samples are enough to see them. The signal is concentrated in a handful of cases that actually discriminate between models: the edge cases that burned you, not a thousand easy ones that every model gets right.

Think about where models actually differ. Every serious model handles the easy 80% of your inputs the same way: ask any of them to summarize a clean paragraph and you can’t tell them apart. The separation lives in the hard 20%: the ambiguous instruction, the document with a contradiction buried on page nine, the request that tempts the model to invent a figure. Twenty cases that lean into that 20% sort the field faster than a thousand that don’t, because they probe the only places the models disagree.

Hamel Husain, who has shipped evals for AI products at GitHub and as a consultant, puts the cost of skipping this bluntly: unsuccessful AI products, he argues, “almost always share a common root cause: a failure to create robust evaluation systems.” A generic benchmark measures generic competence. A twenty-example set built from your worst recent failures measures the thing you actually care about: whether this model survives the inputs that broke the last one.

One team’s 20-example eval · pass count per category (of 5)

Illustrative. The public chart-topper is competent everywhere and best at nothing that matters here, and finishes third.

How to build your eval in an afternoon

None of this requires a framework, a platform, or a budget. It requires a spreadsheet and a couple of hours. Here is the whole loop.

Collect twenty real inputs. Pull the actual prompts, documents, or tickets you’ve sent an AI in the last month. Skew hard toward the cases that went wrong: the summary that missed the point, the email that got the tone wrong. Easy cases waste a slot.
Write down what a good answer must contain. For each input, a one-line checklist or a short reference answer: “names the deadline, keeps a formal tone, invents no figures.” A rubric, not a vibe. This is the step that turns “it felt better” into something you can score.
Run all twenty through two or three candidates. Same inputs, same settings, outputs saved side by side. This is where a local-first setup helps: you can put an open-weight model that runs on your laptop in the same race as a frontier API and judge them on identical terms.
Grade the first pass yourself. Twenty outputs against twenty checklists takes thirty minutes, and you’ll learn what “good” even means for your task while you do it.
Then automate the grading. Hand the rubric to a strong model and ask for a pass/fail plus a reason. The MT-Bench study found GPT-4 agreed with human judges about 85% of the time, higher than humans agreed with each other (81%). An LLM-as-judge is a reasonable stand-in once your rubric is sharp.

One discipline keeps you honest: don’t over-read a one-point gap. With twenty examples, a single flipped answer is noise, and Anthropic recommends reporting the standard error alongside the score for exactly this reason. A 17-versus-16 result is a tie. A 17-versus-11 result is a decision.

A hand writing a checklist with empty checkboxes in a grid-paper notebook. — Two columns of an afternoon: the checklist that defines “good,” and the red pen that enforces it. Photos by Glenn Carstens-Peters and Kelly Sikkema on Unsplash.

A hand holding a red pen over a blank spiral notebook on a wooden desk. — Two columns of an afternoon: the checklist that defines “good,” and the red pen that enforces it. Photos by Glenn Carstens-Peters and Kelly Sikkema on Unsplash.

Keep the test. Swap the models.

Here is the part that makes the afternoon pay for itself for years. The eval is a durable asset; the models are not. They get renamed, deprecated, repriced, and discontinued on someone else’s schedule, a churn we mapped in AI products are mortal. Your twenty examples outlive all of it.

So the next launch-day chart stops being a reason to switch and becomes a reason to run your test. New model drops, you feed it the same twenty inputs, you read the score in ten minutes, and you decide with evidence instead of a thread. It works just as well for the model on your laptop as the one behind an API, which is the honest way to settle whether you actually need a frontier model for a given job, or whether a cheaper, more private one clears your bar. It’s also the antidote to the marketing gaps we covered in the long-context mirage: a spec sheet makes a promise, your eval checks it.

And re-running it is nearly free. Twenty short prompts through even a flagship model is a few cents of tokens: the price of a habit you can afford on every launch day for the rest of the year. The expensive part, deciding what “good” means for your work, you pay for exactly once. Everything after that is a ten-minute check against a standard you already wrote down.

A public leaderboard is a weather forecast for a city you don’t live in. Interesting context; useless for deciding whether to grab an umbrella at your own front door. Spend the afternoon. Twenty examples, a rubric, two or three models, one honest number that belongs to you. After that, the charts can do whatever they like; you’ll already know which model does your work, and that’s the only ranking that was ever going to matter.

The only AI benchmark that matters has 20 questions, and you write it

The model that tops the leaderboard probably isn’t your best pick

The benchmarks behind the rankings are saturated, contaminated, or gamed

A leaderboard can be gamed. A test you wrote can’t.

Twenty examples from your own work beat ten thousand from someone else’s

How to build your eval in an afternoon

Keep the test. Swap the models.

Build a PC for local AI in 2026: the VRAM-first guide

AI for researchers: delegate the reading, never the rigor

Sora vs Veo vs Kling in 2026: one shutdown, one successor, one survivor

One-time payment. Yours forever.