What is an AI agent? A model in a loop, explained
The AI that fixed 2% of real bugs in 2023 fixes 95% today. The difference isn’t a smarter model — it’s a loop. Here’s how it works.
In October 2023, researchers handed the best AI model in the world a list of real bugs from public code repositories and asked it to fix them. It managed 1.96%. This month, the top of the same benchmark’s leaderboard reads 95.0%. Models improved enormously in those 32 months — but the bigger change is what we stopped doing: asking for the answer in one shot.
We gave the model a goal instead of a question, handed it tools, and let it try, look at what happened, and try again. That arrangement has a name, and it is the most overused word in technology right now.
An AI agent is a model in a loop: it takes a goal, picks an action, uses a tool, reads the result, and goes around again — until it decides the job is done. That one sentence explains why agents suddenly work, why they cost several times what chat costs, and why they fail in ways a chatbot never could. This post unpacks all three.
An agent is a model in a loop, not a smarter model
Strip the buzzword and there are only four parts. A goal: “find me a flight under $400,” not “what are some flight websites?” Tools: things the model can actually do — search the web, run code, read a file, call an API. A loop: after every action, the result goes back into the model, which decides what to do next. And a stopping condition: some way to judge that the goal is met — or that it’s time to give up and ask a human.
Anthropic, whose engineering guide is the closest thing the field has to a shared definition, puts it in one line: agents are “just LLMs using tools based on environmental feedback in a loop.” Note what’s missing from that sentence: any claim about intelligence. The model inside an agent is often the same model behind the chat box. What changed is the wiring around it.
The homely analogy is already in your living room. A robot vacuum gets a goal (clean floor), has tools (wheels, brushes, bump sensors), runs a loop (move, hit something, adjust), and has a stopping condition (dock when done). Nobody steers it. Nobody scripted its exact path. You judge it by the state of the floor.

The loop, not the model, closed the gap from 2% to 95%
Why does the loop matter so much? Because almost every real task is impossible in one pass — not hard, impossible. Fixing a bug means reading the code first, and you don’t know which file matters until you look. Booking a flight means seeing prices before choosing one. A single model call, however brilliant, answers from what’s in front of it. A loop can go find out.
SWE-bench, the benchmark from the numbers above, makes this concrete. Each task is a real GitHub issue from a real codebase: make the failing tests pass. In 2023, models were shown the issue and some code and asked to emit a patch — one shot, no feedback. Claude 2, the best of that crop, resolved 1.96%. Today’s leaders run as agents: they navigate the repository, run the test suite, read the errors, edit, and re-run until the suite is green. Same kind of exam, and the loop turns it from a memory test into actual work.
The honest version of the story is that both layers improved — the models are far better at choosing actions, and the harnesses are far better at offering them. But the order matters: the loop is what lets a model’s intelligence touch the world at all. That’s why “agent” is an architecture word, not a marketing tier.
Agent vs. workflow: one question — who picks the next step?
Agents get confused with their better-behaved sibling, the AI workflow. Both use models, both use tools, both run multi-step jobs. The difference fits in a question: who decided the steps?
In a workflow, a person did, in advance. Transcribe the meeting, then summarize, then draft the follow-up email — the same three steps in the same order, every time, with the model filling in content. Anthropic’s guide calls these “predefined code paths.” In an agent, the model decides at runtime. It might take three steps or thirty. Two runs of the same goal can take different routes.
Neither is the upgrade of the other. A workflow is predictable: same steps, same cost, auditable in advance — which is exactly what you want for the report you generate every Monday. An agent handles what you couldn’t script: the bug you haven’t diagnosed, the question whose answer determines the next question. The guide’s own advice is to default down, not up: find “the simplest solution possible” — often a single well-crafted model call — and add machinery only when the task demands it.
The loop is a meter that runs
Autonomy has a price, and it’s denominated in tokens. Every trip around the loop sends the model the goal, the tool results so far, and the growing transcript of its own reasoning — and bills for all of it, again. A chat exchange is one metered call. An agent is a meter that runs until the loop decides to stop.
Anthropic published the multipliers from its own production systems: agents burn about 4× the tokens of a chat interaction, and multi-agent systems about 15×. The same write-up reports that in one internal evaluation, token usage by itself explained 80% of the performance variance — the systems that thought longer and searched more did better. Read those two findings together and the economics of agents fall out: performance is bought with tokens, so an agent that works is, almost by definition, an agent that spends.

For a daily-driver tool this is manageable — dollars, not cents, per serious task. The trap is the unbounded case: a loop that can’t tell it’s stuck will happily re-read the same files all afternoon. Every production agent framework ends up with the same safety rail for this reason: a budget cap and a maximum number of turns, after which the agent must stop and show a human what it has.
Agents are impressive once and unreliable eight times
Here is the number that separates demos from deployments. τ-bench, a benchmark that tests agents on customer-service tasks — airline rebookings, retail returns, with policies to follow — found that GPT-4o-based agents completed fewer than 50% of tasks. The sharper finding is what happened on repeat: asked to handle the same task eight times in a row, the agent got it right every time in fewer than 25% of cases. Not a new task — the same one.
That’s the difference between a colleague and a slot machine. A process you’d hand to a junior employee has to work the eighth time, too. And the math is unforgiving: per-run reliability compounds, so even an agent that genuinely clears 90% becomes a coin flip over a week of daily runs.
This is why the corporate version of the story is messier than the demos. Gartner predicted that over 40% of agentic AI projects will be canceled by the end of 2027 — costs, unclear value, weak risk controls — and estimated that of the thousands of vendors selling “agents,” only about 130 offer the real thing. The rest is rebranded chatbots and scripted automation: “agent washing.” If you’re evaluating one, ignore the label and the leaderboard, and run it on twenty of your own tasks — more than once each.
The leash is getting longer on a measurable clock
None of this is a verdict on agents; it’s a snapshot of a moving object — and the motion has been measured. METR, an AI evaluation lab, tracks the length of task — in human working time — that frontier models can finish autonomously at least half the time. Their headline finding: that horizon has doubled roughly every seven months for six years.
Their January 2026 update put the frontier at about 320 minutes — five-plus hours of human work — and clocked the doubling time since 2024 at under three months. In 2022, the same measure was a couple of minutes. The practical translation: the loop you couldn’t trust with a ten-minute task two years ago is now finishing afternoon-sized ones, and the trend line hasn’t bent yet.

The half-the-time caveat is doing real work in that sentence, though — 50% reliability is a research milestone, not a deployment standard. Treat autonomy like airspace: granted in increments, measured in minutes, expanded as the safety record earns it.
Use the smallest thing that works
Strip the hype and the decision is a ladder with three rungs.
- One model call when the task fits in a single pass: summarize, draft, translate, classify. Most tasks live here, and the chat box already does this well.
- A workflow when you know the steps and repeat them: the same pipeline every week, predictable cost, auditable in advance.
- An agentwhen you can state the goal but not the steps — and the result is checkable, the actions are reversible, and the budget is capped.
And keep the definition as your filter. “Agent” on a pricing page tells you nothing; the loop does. Does it choose its own next step from feedback? Does it know when to stop? What happens on the eighth run? A model in a loop with tools and a stopping condition — everything else is packaging.
Agents: quick answers
The word will keep inflating; the mechanism won’t. Goal, tools, loop, stop. See those four parts clearly and you can price an agent, test an agent, and decide what you’d actually trust one to do — this year, and each longer-leashed year after.


