TutorialLocal AIOpen SourceJuly 5, 20269 min read

Run Ornith locally: the open coding model built for agents, not you

It scores next to Claude Opus 4.7 at fixing real bugs, and it's a free MIT download. The strange part: it wrote its own training guide.

By Atul

Ornith-1.0 · DeepReinforce · MIT

Open weights

A coding model that, during training, wrote its own study guide.

One step of its training loop

1 · Write the plan

Read the task and its last attempt, then draft a sharper approach for solving it.

→

2 · Solve the task

Use that plan to write and run the code, and see whether the tests pass.

↵ The score feeds back into both steps. Repeated millions of times, the model tunes not just its code but the plan it writes for itself first.

Every other open coder learns inside a training harness a human engineered. Ornith learned to write its own.

A new open coding model just scored 82.4 on a benchmark of real GitHub bug fixes, roughly where Claude Opus 4.7 lands, and a smaller version of it runs on a gaming PC. The whole family is free, the license is MIT, and you can download it today. The odd part is not the score. It is how the model learned to code: nobody handed it a training harness, so it built its own.

The model is Ornith-1.0, released on June 25, 2026 by DeepReinforce, a small research lab better known until now for a CUDA optimization project than for foundation models. Ornith is not a general chatbot that also writes code. It is a specialist, tuned to work like a software agent: read a repository, run the tests, fix what breaks, and repeat. This is the lineup, the strange training trick behind it, which size fits your machine, how to run it, and the one thing it is genuinely bad at.

Ornith is built to act, not to chat

Most AI coding help is a conversation. You describe a problem, the model replies with a block of code, you paste it in, and if it fails you go back and ask again. That loop keeps a human in the middle of every step. Ornith is built for the other shape of the job, the one where the model takes a task and drives it to done on its own: opening files, running the test suite, reading the failure, editing a patch, and running the tests again until they pass.

That shape has a name now, and a whole category behind it. An agent is a model in a loop with tools, and agentic coding is that loop pointed at a codebase. DeepReinforce is blunt that this is the whole design: Ornith is, in the words of one write-up, built for agents, not humans. It is meant to sit inside a coding tool and work, not to hold a friendly conversation about your weekend.

Why does a specialist beat a generalist here? Budget. A model that has to be good at poetry, trivia, translation, and small talk spends its training on all of it. A model that only has to fix code spends every bit of that budget on the one job. That focus is how a family topping out at 397 billion parameters trades punches with far larger general models on coding, and it is the same bet Poolside made with its code-native Laguna models. The catch, which we will come back to, is that the focus cuts both ways.

Code on dual monitors under warm ambient light, a workstation set up for hands-on programming. — Ornith is tuned for the run-the-tests, fix-the-failure loop that most “agentic” coding actually is. Photo by Fotis Fotopoulos on Unsplash.

It learned by writing its own coaching notes

Here is the part that makes Ornith worth a blog post rather than a line in a roundup. Training a coding model with reinforcement learning means putting it in a harness: a fixed set of instructions and tools that tell the model how to approach a problem before it attempts an answer. Engineers usually design that harness by hand, and its quality caps how well the model can learn. A clumsy harness teaches clumsy habits.

Ornith removes the human from that step. At each round of training, the model does two things in sequence. First it reads the task and its own previous approach, then proposes a better one: this is the scaffold, the plan it writes for itself. Second it uses that fresh plan to actually attempt the fix. The reward from whether the code worked flows back into both steps, so the model is graded not only on the patch it wrote but on the plan it drew up first. DeepReinforce describes the mechanics, a token-level GRPO objective over the two stages, in its release write-up.

The name is the tell. Ornith comes from the Greek for bird, and the pitch is that, like a bird building its own nest, the model constructs its own scaffolding before it does the work. Repeat that a few million times and the model does not inherit someone else’s playbook for solving bugs. It writes, tests, and sharpens its own. That is a genuinely different way to train a coder, and it is why a lab with almost no foundation-model track record showed up with numbers worth checking.

A weaver bird perched in the intricate hanging nest it built from woven grass. — The metaphor DeepReinforce chose: a bird builds its own structure before it settles in. Ornith writes its own training scaffold before it solves. Photo by Ali Kazal on Unsplash.

Four sizes, and only two fit on your desk

Ornith-1.0 ships as a family of four, and the split maps cleanly onto the hardware you have. Two are plain dense models: a 9B that runs on a laptop and a 31B that wants a mid-range GPU or a well-specced Mac. Two are Mixture-of-Experts models, where the total parameter count is large but only a slice fires on each token: a 35B that behaves like something much smaller to run, and a 397B flagship built for a server. All four carry the same 262,144-token context window, roughly a small codebase held in memory at once.

The Ornith-1.0 family, smallest to largest

Model

Params

Ctx

Min hardware

What it's for

Ornith 9B

9B dense

262K

Laptop, ~19 GB

The smallest: agentic coding on modest hardware

Ornith 31B

31B dense

262K

24–32 GB GPU/Mac

The dense middle: more headroom, still one machine

Ornith 35B

35B / ~3B active

262K

Single consumer GPU

The local sweet spot: MoE speed, one-card footprint

Ornith 397B

397B MoE

262K

Multi-GPU server

The flagship: frontier-class scores, server hardware

Sizes and the shared 262K-token context from the Hugging Face model cards. The 35B is a Mixture-of-Experts model: only about 3B parameters fire per token, which is why it runs at the speed of a much smaller model.

For almost everyone reading this, the choice is between the 9B and the 35B. The 9B is the one that fits an ordinary laptop with room to spare. The 35B MoE is the sweet spot if you have a single decent graphics card or an Apple Silicon machine with enough memory: it fires only about 3 billion parameters per token, so it answers at the speed of a small model while holding the knowledge of a bigger one. The 31B dense model is the pick if you want dense-model consistency and have the memory for it. The 397B is a different project entirely, one you stand up on a multi-GPU box for a whole team, not something you install on a Tuesday.

The scores are strong, with an honest asterisk

Take the benchmarks at face value first, then add the caveats. On SWE-bench Verified, a set of real bugs from open-source projects where a model only passes an item when its patch makes the project’s own tests go green, the 397B flagship scores 82.4. On Terminal-Bench 2.1, which checks whether a model can drive a command line to finish a task, it hits 77.5. Both numbers beat Claude Opus 4.7 and the open DeepSeek V4-Pro, per DeepReinforce’s reported figures. For an open-weight model from an unknown lab, that is a real result.

Two coding benchmarks, and where each model runs

Model

SWE-bench Ver.

Terminal-Bench

Runs where

Claude Opus 4.8

87.6

cloud, closedclosed

Ornith 397B

82.4

77.5

your serveropen

Claude Opus 4.7

80.8

70.3

cloud, closedclosed

DeepSeek V4-Pro

80.6

server, openopen

Ornith 9B

69.4

43.1

your laptopopen

Scores as reported by DeepReinforce, collected by MarkTechPost. The flagship 397B slots between two Claude generations; the laptop-run 9B sits lower but is the one you actually install.

Now the asterisks, because a fair post owes you both. The frontier-matching numbers belong to the 397B, which needs server-class hardware; they are not what your laptop runs. Claude Opus 4.8, released since, already scores higher at 87.6 and 85, so “matches the frontier” means last quarter’s frontier, not this week’s. And the scores are self-reported, not yet independently replicated across the board. GLM-5.2, another open coder that shipped within a day of Ornith, edges the flagship on Terminal-Bench at 81.0, though it is nearly twice the size and sold mostly as a hosted API.

The laptop-class 9B scores lower, 69.4 and 43.1, and that is the honest number for what most people will run. It is still useful: the independent developer Simon Willison ran the 35B build locally through LM Studio, wired it into an agent, and reported that it correctly found specific code in a real repository, with “initial impressions very good.” A model that finds the right file and proposes a working change, entirely on your machine, is worth more than a leaderboard row.

Running it takes one command

The fastest path is Ollama, the same one-command runner the rest of the local-model guides lean on. DeepReinforce publishes GGUF builds directly, so you pull the model straight from Hugging Face:

ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

That default 4-bit build is a 21 GB download. Higher-precision tags trade disk and memory for sharper answers, and the 9B repository follows the same pattern if you want the lighter model.

Ornith 35B, pick the precision your memory allows

Ollama tag

Size

What you get

:Q4_K_M

21.2 GB

4-bit: the everyday pick for one consumer GPU or a 32 GB Mac

:Q5_K_M

24.7 GB

5-bit: a step sharper if the memory is there

:Q6_K

28.5 GB

6-bit: close to full quality

:Q8_0

36.9 GB

8-bit: near-lossless, workstation memory

:bf16

69.4 GB

Full precision: Mac Studio or multi-GPU territory

GGUF file sizes from the official 35B GGUF repository. Every tag carries the full 262K context; leave a few gigabytes free for the code you feed it.

Prefer a different runner? The same weights work with llama.cpp (llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M), and DeepReinforce ships full weights for vLLM and SGLang, which is the route you take to serve the big 397B to a team. One practical note: the model card recommends a temperature of 0.6 with top-p 0.95, and you set --max-model-len 262144 if you actually need the full context. Once it is serving on a local port, point an editor extension or an agent framework at that endpoint and you have a private coding assistant that never sends a line of your source anywhere.

The 35B MoE is aimed squarely at a single consumer GPU, which is what makes it the family’s local default. Photo by Mariia Shalabaieva on Unsplash.

It is a narrow specialist, and that is the point

Now the thing Ornith is bad at, stated plainly because the model card states it plainly: it may underperform on anything outside agentic coding. This is not a general assistant. Ask it to draft an email, explain a tax form, or brainstorm names, and a smaller general model will very likely do better. Ornith spent its whole training budget learning to fix code the way an agent fixes code, and the flip side of that focus is a model with little to say about the rest of your day.

Read the right way, that is a feature. You do not need one model to do everything, you need the right few. Ornith is a strong candidate for the coding slot: an MIT-licensed, open-weight, agent-shaped coder that runs on your own hardware. The license matters as much as the scores here. MIT means you can use it commercially, fine-tune it on your private repositories, and ship what it writes with no royalty and no regional restriction. It sits on top of Gemma 4 and Qwen 3.5, both Apache-2.0 base models, so the licensing chain is clean the whole way down.

The takeaway is short. If you want one AI that chats, plans, and codes, Ornith is the wrong tool and you should reach for a generalist. If you want a coding model that behaves like an agent, keeps your source on your own machine, and costs nothing to run all day, download the 9B on a laptop or the 35B on a GPU and point your tools at it. The bird built its own nest. You just get to move in.

Run Ornith locally: the open coding model built for agents, not you

Ornith is built to act, not to chat

It learned by writing its own coaching notes

Four sizes, and only two fit on your desk

The scores are strong, with an honest asterisk

Running it takes one command

It is a narrow specialist, and that is the point

Run Poolside Laguna locally: a coding model your source never leaves

The best on-device AI apps for Linux (2026)

The best on-device AI apps for Windows (2026)

One-time payment. Yours forever.