Run Ornith locally: the open coding model built for agents, not you
It scores next to Claude Opus 4.7 at fixing real bugs, and it's a free MIT download. The strange part: it wrote its own training guide.
A new open coding model just scored 82.4 on a benchmark of real GitHub bug fixes, roughly where Claude Opus 4.7 lands, and a smaller version of it runs on a gaming PC. The whole family is free, the license is MIT, and you can download it today. The odd part is not the score. It is how the model learned to code: nobody handed it a training harness, so it built its own.
The model is Ornith-1.0, released on June 25, 2026 by DeepReinforce, a small research lab better known until now for a CUDA optimization project than for foundation models. Ornith is not a general chatbot that also writes code. It is a specialist, tuned to work like a software agent: read a repository, run the tests, fix what breaks, and repeat. This is the lineup, the strange training trick behind it, which size fits your machine, how to run it, and the one thing it is genuinely bad at.
Ornith is built to act, not to chat
Most AI coding help is a conversation. You describe a problem, the model replies with a block of code, you paste it in, and if it fails you go back and ask again. That loop keeps a human in the middle of every step. Ornith is built for the other shape of the job, the one where the model takes a task and drives it to done on its own: opening files, running the test suite, reading the failure, editing a patch, and running the tests again until they pass.
That shape has a name now, and a whole category behind it. An agent is a model in a loop with tools, and agentic coding is that loop pointed at a codebase. DeepReinforce is blunt that this is the whole design: Ornith is, in the words of one write-up, built for agents, not humans. It is meant to sit inside a coding tool and work, not to hold a friendly conversation about your weekend.
Why does a specialist beat a generalist here? Budget. A model that has to be good at poetry, trivia, translation, and small talk spends its training on all of it. A model that only has to fix code spends every bit of that budget on the one job. That focus is how a family topping out at 397 billion parameters trades punches with far larger general models on coding, and it is the same bet Poolside made with its code-native Laguna models. The catch, which we will come back to, is that the focus cuts both ways.

It learned by writing its own coaching notes
Here is the part that makes Ornith worth a blog post rather than a line in a roundup. Training a coding model with reinforcement learning means putting it in a harness: a fixed set of instructions and tools that tell the model how to approach a problem before it attempts an answer. Engineers usually design that harness by hand, and its quality caps how well the model can learn. A clumsy harness teaches clumsy habits.
Ornith removes the human from that step. At each round of training, the model does two things in sequence. First it reads the task and its own previous approach, then proposes a better one: this is the scaffold, the plan it writes for itself. Second it uses that fresh plan to actually attempt the fix. The reward from whether the code worked flows back into both steps, so the model is graded not only on the patch it wrote but on the plan it drew up first. DeepReinforce describes the mechanics, a token-level GRPO objective over the two stages, in its release write-up.
The name is the tell. Ornith comes from the Greek for bird, and the pitch is that, like a bird building its own nest, the model constructs its own scaffolding before it does the work. Repeat that a few million times and the model does not inherit someone else’s playbook for solving bugs. It writes, tests, and sharpens its own. That is a genuinely different way to train a coder, and it is why a lab with almost no foundation-model track record showed up with numbers worth checking.

Four sizes, and only two fit on your desk
Ornith-1.0 ships as a family of four, and the split maps cleanly onto the hardware you have. Two are plain dense models: a 9B that runs on a laptop and a 31B that wants a mid-range GPU or a well-specced Mac. Two are Mixture-of-Experts models, where the total parameter count is large but only a slice fires on each token: a 35B that behaves like something much smaller to run, and a 397B flagship built for a server. All four carry the same 262,144-token context window, roughly a small codebase held in memory at once.
For almost everyone reading this, the choice is between the 9B and the 35B. The 9B is the one that fits an ordinary laptop with room to spare. The 35B MoE is the sweet spot if you have a single decent graphics card or an Apple Silicon machine with enough memory: it fires only about 3 billion parameters per token, so it answers at the speed of a small model while holding the knowledge of a bigger one. The 31B dense model is the pick if you want dense-model consistency and have the memory for it. The 397B is a different project entirely, one you stand up on a multi-GPU box for a whole team, not something you install on a Tuesday.
The scores are strong, with an honest asterisk
Take the benchmarks at face value first, then add the caveats. On SWE-bench Verified, a set of real bugs from open-source projects where a model only passes an item when its patch makes the project’s own tests go green, the 397B flagship scores 82.4. On Terminal-Bench 2.1, which checks whether a model can drive a command line to finish a task, it hits 77.5. Both numbers beat Claude Opus 4.7 and the open DeepSeek V4-Pro, per DeepReinforce’s reported figures. For an open-weight model from an unknown lab, that is a real result.
Now the asterisks, because a fair post owes you both. The frontier-matching numbers belong to the 397B, which needs server-class hardware; they are not what your laptop runs. Claude Opus 4.8, released since, already scores higher at 87.6 and 85, so “matches the frontier” means last quarter’s frontier, not this week’s. And the scores are self-reported, not yet independently replicated across the board. GLM-5.2, another open coder that shipped within a day of Ornith, edges the flagship on Terminal-Bench at 81.0, though it is nearly twice the size and sold mostly as a hosted API.
The laptop-class 9B scores lower, 69.4 and 43.1, and that is the honest number for what most people will run. It is still useful: the independent developer Simon Willison ran the 35B build locally through LM Studio, wired it into an agent, and reported that it correctly found specific code in a real repository, with “initial impressions very good.” A model that finds the right file and proposes a working change, entirely on your machine, is worth more than a leaderboard row.
Running it takes one command
The fastest path is Ollama, the same one-command runner the rest of the local-model guides lean on. DeepReinforce publishes GGUF builds directly, so you pull the model straight from Hugging Face:
ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_MThat default 4-bit build is a 21 GB download. Higher-precision tags trade disk and memory for sharper answers, and the 9B repository follows the same pattern if you want the lighter model.
Prefer a different runner? The same weights work with llama.cpp (llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M), and DeepReinforce ships full weights for vLLM and SGLang, which is the route you take to serve the big 397B to a team. One practical note: the model card recommends a temperature of 0.6 with top-p 0.95, and you set --max-model-len 262144 if you actually need the full context. Once it is serving on a local port, point an editor extension or an agent framework at that endpoint and you have a private coding assistant that never sends a line of your source anywhere.

It is a narrow specialist, and that is the point
Now the thing Ornith is bad at, stated plainly because the model card states it plainly: it may underperform on anything outside agentic coding. This is not a general assistant. Ask it to draft an email, explain a tax form, or brainstorm names, and a smaller general model will very likely do better. Ornith spent its whole training budget learning to fix code the way an agent fixes code, and the flip side of that focus is a model with little to say about the rest of your day.
Read the right way, that is a feature. You do not need one model to do everything, you need the right few. Ornith is a strong candidate for the coding slot: an MIT-licensed, open-weight, agent-shaped coder that runs on your own hardware. The license matters as much as the scores here. MIT means you can use it commercially, fine-tune it on your private repositories, and ship what it writes with no royalty and no regional restriction. It sits on top of Gemma 4 and Qwen 3.5, both Apache-2.0 base models, so the licensing chain is clean the whole way down.
The takeaway is short. If you want one AI that chats, plans, and codes, Ornith is the wrong tool and you should reach for a generalist. If you want a coding model that behaves like an agent, keeps your source on your own machine, and costs nothing to run all day, download the 9B on a laptop or the 35B on a GPU and point your tools at it. The bird built its own nest. You just get to move in.


