ExplainerAgentsAI ModelsJune 11, 202611 min read

What is an AI agent? A model in a loop, explained

The AI that fixed 2% of real bugs in 2023 fixes 95% today. The difference isn’t a smarter model: it’s a loop. Here’s how it works.

One agent run, traced

Goal in · loop · done out

An agent is a model, tools, and a loop.

GoalMake the failing checkout test pass

ThinkRun the test suite and see what actually breaks.

Actrun_tests()

See2 failures, both in cart totals

ThinkOpen the cart code around the failing assertion.

Actread_file(cart.ts)

Seerounding applied after tax, not before

ThinkFix the order of operations, then re-run.

Actedit_file(cart.ts) · run_tests()

Seeall 14 tests pass

StopGoal met. The loop decides it’s done

Nobody scripted those nine steps. The model chose each one based on what the previous tool call returned. That loop (not a smarter model) is the entire definition of an agent.

In October 2023, researchers handed the best AI model in the world a list of real bugs from public code repositories and asked it to fix them. It managed 1.96%. This month, the top of the same benchmark’s leaderboard reads 95.0%. Models improved enormously in those 32 months, but the bigger change is what we stopped doing: asking for the answer in one shot.

We gave the model a goal instead of a question, handed it tools, and let it try, look at what happened, and try again. That arrangement has a name, and it is the most overused word in technology right now.

An AI agent is a model in a loop: it takes a goal, picks an action, uses a tool, reads the result, and goes around again, until it decides the job is done. That one sentence explains why agents suddenly work, why they cost several times what chat costs, and why they fail in ways a chatbot never could. This post unpacks all three.

An agent is a model in a loop, not a smarter model

Strip the buzzword and there are only four parts. A goal: “find me a flight under $400,” not “what are some flight websites?” Tools: things the model can actually do: search the web, run code, read a file, call an API. A loop: after every action, the result goes back into the model, which decides what to do next. And a stopping condition: some way to judge that the goal is met, or that it’s time to give up and ask a human.

Anthropic, whose engineering guide is the closest thing the field has to a shared definition, puts it in one line: agents are “just LLMs using tools based on environmental feedback in a loop.” Note what’s missing from that sentence: any claim about intelligence. The model inside an agent is often the same model behind the chat box. What changed is the wiring around it.

The homely analogy is already in your living room. A robot vacuum gets a goal (clean floor), has tools (wheels, brushes, bump sensors), runs a loop (move, hit something, adjust), and has a stopping condition (dock when done). Nobody steers it. Nobody scripted its exact path. You judge it by the state of the floor.

A robot vacuum emerging from under a couch on a wooden floor. — You say “clean the floor,” not “forward 2 meters, turn left.” Sense, act, check, repeat, stop when done: a robot vacuum is the loop made visible. Photo by Onur Binay on Unsplash.

The loop, not the model, closed the gap from 2% to 95%

Why does the loop matter so much? Because almost every real task is impossible in one pass. Not hard, impossible. Fixing a bug means reading the code first, and you don’t know which file matters until you look. Booking a flight means seeing prices before choosing one. A single model call, however brilliant, answers from what’s in front of it. A loop can go find out.

SWE-bench, the benchmark from the numbers above, makes this concrete. Each task is a real GitHub issue from a real codebase: make the failing tests pass. In 2023, models were shown the issue and some code and asked to emit a patch, one shot, no feedback. Claude 2, the best of that crop, resolved 1.96%. Today’s leaders run as agents: they navigate the repository, run the test suite, read the errors, edit, and re-run until the suite is green. Same kind of exam, and the loop turns it from a memory test into actual work.

Same exam, 32 months apart · SWE-bench, real GitHub issues

October 2023 · one-shot answer

1.96%

Best model (Claude 2), handed the issue and asked for a patch in one pass.

June 2026 · agent in a loop

95.0%

Best agent (Claude Fable 5), allowed to read files, run tests, and retry until done.

From the original SWE-bench paper to the June 2026 Verified leaderboard. The models got far stronger too, but no model answers a codebase question in one shot. The harness made the score possible.

The honest version of the story is that both layers improved: the models are far better at choosing actions, and the harnesses are far better at offering them. But the order matters: the loop is what lets a model’s intelligence touch the world at all. That’s why “agent” is an architecture word, not a marketing tier.

Agent vs. workflow: one question, who picks the next step?

Agents get confused with their better-behaved sibling, the AI workflow. Both use models, both use tools, both run multi-step jobs. The difference fits in a question: who decided the steps?

In a workflow, a person did, in advance. Transcribe the meeting, then summarize, then draft the follow-up email: the same three steps in the same order, every time, with the model filling in content. Anthropic’s guide calls these “predefined code paths.” In an agent, the model decides at runtime. It might take three steps or thirty. Two runs of the same goal can take different routes.

Chat · workflow · agent: the difference in three rows

Chat

Workflow

Agent

Who picks the next step

You, every turn

The builder, in advance

The model, as it goes

When it stops

When you stop typing

After the last step

When it judges the goal met

Cost of one run

One call

Fixed: n steps, n calls

Unknown until it stops

Anthropic’s engineering guide draws the same line: workflows run on “predefined code paths”; agents “dynamically direct their own processes and tool usage.”

Neither is the upgrade of the other. A workflow is predictable: same steps, same cost, auditable in advance, which is exactly what you want for the report you generate every Monday. An agent handles what you couldn’t script: the bug you haven’t diagnosed, the question whose answer determines the next question. The guide’s own advice is to default down, not up: find “the simplest solution possible” (often a single well-crafted model call) and add machinery only when the task demands it.

The loop is a meter that runs

Autonomy has a price, and it’s denominated in tokens. Every trip around the loop sends the model the goal, the tool results so far, and the growing transcript of its own reasoning, and bills for all of it, again. A chat exchange is one metered call. An agent is a meter that runs until the loop decides to stop.

Anthropic published the multipliers from its own production systems: agents burn about 4× the tokens of a chat interaction, and multi-agent systems about 15×. The same write-up reports that in one internal evaluation, token usage by itself explained 80% of the performance variance: the systems that thought longer and searched more did better. Read those two findings together and the economics of agents fall out: performance is bought with tokens, so an agent that works is, almost by definition, an agent that spends.

A vintage General Electric analog voltmeter with the needle high on the dial. — A chat call is a price tag; an agent is a meter. It keeps charging until the loop decides to stop. Photo by Thomas Kelley on Unsplash.

For a daily-driver tool this is manageable: dollars, not cents, per serious task. The trap is the unbounded case: a loop that can’t tell it’s stuck will happily re-read the same files all afternoon. Every production agent framework ends up with the same safety rail for this reason: a budget cap and a maximum number of turns, after which the agent must stop and show a human what it has.

Agents are impressive once and unreliable eight times

Here is the number that separates demos from deployments. τ-bench, a benchmark that tests agents on customer-service tasks (airline rebookings, retail returns, with policies to follow) found that GPT-4o-based agents completed fewer than 50% of tasks. The sharper finding is what happened on repeat: asked to handle the same task eight times in a row, the agent got it right every time in fewer than 25% of cases. Not a new task. The same one.

That’s the difference between a colleague and a slot machine. A process you’d hand to a junior employee has to work the eighth time, too. And the math is unforgiving: per-run reliability compounds, so even an agent that genuinely clears 90% becomes a coin flip over a week of daily runs.

What “90% reliable” turns into on repeat

Odds that an agent which succeeds 90% of the time gets the same job right every time, k runs in a row.

1 run

90%

2 in a row

81%

4 in a row

66%

8 in a row

43%

Pure arithmetic: 0.9 multiplied by itself k times. Measured agents do worse: τ-bench found GPT-4o-based agents passed under 50% of customer-service tasks once, and under 25% eight times in a row.

This is why the corporate version of the story is messier than the demos. Gartner predicted that over 40% of agentic AI projects will be canceled by the end of 2027 (costs, unclear value, weak risk controls) and estimated that of the thousands of vendors selling “agents,” only about 130 offer the real thing. The rest is rebranded chatbots and scripted automation: “agent washing.” If you’re evaluating one, ignore the label and the leaderboard, and run it on twenty of your own tasks. More than once each.

The leash is getting longer on a measurable clock

None of this is a verdict on agents; it’s a snapshot of a moving object, and the motion has been measured. METR, an AI evaluation lab, tracks the length of task (in human working time) that frontier models can finish autonomously at least half the time. Their headline finding: that horizon has doubled roughly every seven months for six years.

Their January 2026 update put the frontier at about 320 minutes (five-plus hours of human work) and clocked the doubling time since 2024 at under three months. In 2022, the same measure was a couple of minutes. The practical translation: the loop you couldn’t trust with a ten-minute task two years ago is now finishing afternoon-sized ones, and the trend line hasn’t bent yet.

An airport control tower lit at dusk behind a perimeter fence. — Autopilot flies the plane; the tower decides how much airspace it gets. Agent autonomy is granted the same way: in minutes of unsupervised work, extended as trust is earned. Photo by Parker Sturdivant on Unsplash.

The half-the-time caveat is doing real work in that sentence, though: 50% reliability is a research milestone, not a deployment standard. Treat autonomy like airspace: granted in increments, measured in minutes, expanded as the safety record earns it.

Use the smallest thing that works

Strip the hype and the decision is a ladder with three rungs.

One model call when the task fits in a single pass: summarize, draft, translate, classify. Most tasks live here, and the chat box already does this well.
A workflow when you know the steps and repeat them: the same pipeline every week, predictable cost, auditable in advance.
An agent when you can state the goal but not the steps, and the result is checkable, the actions are reversible, and the budget is capped.

And keep the definition as your filter. “Agent” on a pricing page tells you nothing; the loop does. Does it choose its own next step from feedback? Does it know when to stop? What happens on the eighth run? A model in a loop with tools and a stopping condition. Everything else is packaging.

Agents: quick answers

Is ChatGPT an agent?

The chat box isn’t. The moment it starts taking actions on its own (browsing, running code, retrying after an error), you’re watching an agent. Same model; the loop is the difference.

What's the difference between an agent and a workflow?

Who picks the next step. In a workflow, a person wired the steps in advance and the AI fills in slots. In an agent, the model reads the result of each action and chooses the next one itself.

Why do agents cost more than chat?

Every trip around the loop re-reads the growing transcript and bills for it. Anthropic measured agents at roughly 4× the tokens of a chat exchange, and multi-agent systems at about 15×.

Can I trust an agent to run unsupervised?

For minutes-long tasks with a clear pass/fail check, increasingly yes. For long or irreversible work, no. Success rates that look fine per run compound into coin flips over a day of runs. Cap the budget, log every action, and keep the undo button.

The word will keep inflating; the mechanism won’t. Goal, tools, loop, stop. See those four parts clearly and you can price an agent, test an agent, and decide what you’d actually trust one to do, this year, and each longer-leashed year after.

What is an AI agent? A model in a loop, explained

An agent is a model in a loop, not a smarter model

The loop, not the model, closed the gap from 2% to 95%

Agent vs. workflow: one question, who picks the next step?

The loop is a meter that runs

Agents are impressive once and unreliable eight times

The leash is getting longer on a measurable clock

Use the smallest thing that works

Agents: quick answers

How to write AI prompts that actually work

Why we built a desktop app in the browser era

What AI can actually do in 2026: a plain-English tour

One-time payment. Yours forever.