The best open AI models you can run locally right now
The model that nearly tied the closed coding leaders is a free download — if you pick the right one for your RAM, and dodge the license traps.
The open-weight model that sat within two points of the closed coding leaders this spring is a file you can download tonight and run with your Wi-Fi switched off. That is genuinely new. For most of the AI era, the good models lived on someone else’s servers and metered every question. In 2026 the best ones you can ownare good enough for the daily eighty percent of real work — writing, coding, transcription, image generation, searching your own documents — and they cost nothing per use.
The catch is no longer capability. It’s choice. There are roughly forty open models worth installing, the good lists rot in a fortnight, and some of the licenses are traps that read as “free” right up until a lawyer looks. So here is a map. Tell me how much memory your computer has and what you want to do, and there’s one model you should download — organized by hardware tier, because that’s the only thing that decides what will actually load, with the license column most guides skip. This was written in June 2026; treat every name as a placeholder for “whatever is current in that slot.”

Start with your RAM, not the leaderboard
Every “best local model” list gets this backwards. The model that tops a benchmark this week is useless to you if it won’t fit in your memory — and memory is the wall almost everyone hits first. On a Mac, the number that matters is the unified memory on the spec sheet you bought. On a PC with a graphics card, it’s the VRAM on the card, not the system RAM — a tower with 64 GB of DDR5 and an 8 GB GPU is a small tier, not a big one.
The arithmetic is simple. A model needs roughly its parameter count times the bytes used per weight. At full precision that’s about 2 bytes each, so a 7-billion-parameter model wants ~14 GB — too much for a normal laptop. Quantization is the trick that fixes it: rewrite each weight at fewer bits. At 4-bit — the common “Q4” — the rule of thumb is about 0.5 GB per billion parameters, so that same 7B model drops to ~4–5 GB and runs comfortably. The llama.cpp quantization tables spell out the exact sizes.
The worry used to be that 4-bit wrecked quality. It mostly doesn’t anymore. Q4_K_M, the popular setting, keeps roughly 95% of full-precision quality at a quarter of the bytes; Q8 is near-lossless; below Q4 the drop becomes visible, and reasoning and math are the first things to suffer. Practical reading: install at Q4, and only walk up to Q5 or Q6 for code that has to compile or proofs that have to hold.

That gives four tiers most readers live in: 8 GB (any modern laptop, 1–4B models), 16 GB (a decent laptop, 7–9B comfortably), 32 GB (a developer laptop, 12–14B daily and a ~30B at the edge), and 64 GB or more (a loaded Mac or a desktop with a 24 GB GPU, where 27–32B models breathe and a 70B runs if you’re patient). If you only care about text and code, the five-by-five flowchart drills into those tiers in more detail; this post goes wider, across every modality.
The jobs that scale with your memory
Four jobs grow with the model, and the model grows with your RAM: chatting and writing, hard reasoning, coding, and reading images. Here is the pick in each square — the model I’d install myself, at the quant that fits.
The grid shrinks to a handful of families because the open world has consolidated. Qwen(Alibaba, Apache 2.0) is the best-rounded family right now — strong at code, multilingual by default, and its sparse mixture-of-experts builds like the 35B-A3B activate only ~3B parameters per token, so they run at small-model speed from big-model weights. DeepSeek(MIT) leans into reasoning; you can’t fit the 1.6-trillion-parameter V4 on a laptop, but the R1-Distill models bake its long “thinking” into 7B, 14B, and 32B weights you can. Gemma (Google) is small, tidy, and ships vision across every size with a generous context window. Mistral (mostly Apache) is the efficient European workhorse, and Phi-4 (Microsoft, MIT) punches above its weight on math at the bottom of the size range.
One name is conspicuously fading: Llama. Meta shipped its first closed model this spring and walked away from open releases, so Llama 3.3 and 4 are now legacy picks rather than the safe default they were a year ago. The open frontier is carried by Qwen, DeepSeek, and Moonshot’s Kimi today — and notably, all three are Chinese labs.
Image, voice, and search fit almost anywhere
The other half of local AI barely touches the tier chart. Speech models, text-to-speech, and embedding models are small enough to run on almost any machine; image generation is the lone exception, leaning on the GPU rather than system RAM. Here’s the short list, with the one job that needs real hardware flagged.
The standouts are worth naming. Kokorois an 82-million-parameter text-to-speech model — small enough to run faster than real time on a plain CPU, which makes it the default for local narration. For transcription, Whisper remains the accuracy benchmark, while Moonshine is the one to run on a Raspberry Pi or inside a live voice agent. On the image side, Qwen-Image-2.0 and FLUX.2 [klein] are the local stars, the latter fast enough for near-interactive generation on a consumer card. And the least glamorous pick on the page, Qwen3-Embedding 0.6B, is the model that turns a folder of PDFs into something you can actually ask questions of — the quiet engine behind every “chat with your documents” feature.

“Open” is not one word — and it can get you sued
Here is the column nobody prints, and the one most likely to cost you. “Open” covers three different permissions, and the gap between downloading a model and shipping a product on it is where people get hurt.
The truly open bucket — Apache 2.0 and MIT — means what you hope it does: run it, fine-tune it, ship it, sell it, no fee. Qwen, DeepSeek, Mistral’s open lineup, Phi-4, Kokoro, Whisper, and Qwen-Image all live here. The middle bucket is “open weights with strings.” Meta’s Llama Community License is free for commercial use unless your product crosses 700 million monthly active users — a cap no real open-source license contains. Google’s Gemma terms allow commercial use too, but bind you to a prohibited-use policy and reserve Google’s right to restrict usage.
The third bucket is the actual trap. FLUX.1 [dev] and its sharper successors are non-commercial — gorgeous, freely downloadable, and forbidden in anything you sell — while the sibling FLUX.1 [schnell] is Apache and fine. F5-TTS and a few others sit in the same spot: the weights are right there, the license says no. If you’re building a business, read the license before you fall in love with the demo. Owning the weights is half the own-it-versus-rent-it argument; the license is the other half.
Where local still loses
This is a field guide, not a sales pitch, so the honest caveats matter. Frontier closed models still win the hardest tasks — an independent US government evaluation this spring put the best open model roughly eight months behind the closed frontier on broad capability work. The best open models, the trillion-parameter Kimi and DeepSeek builds, need a server, not a laptop. Quantization that fits a model into your RAM does cost a little quality. And setup is a real tax: a runtime to install, a multi-gigabyte download, the occasional model that won’t load.
Which is why the honest recommendation is hybrid, not absolutist. Run the local model for the daily eighty percent — the drafting, the transcription, the quick code, the private documents — and reach for a frontier cloud model on the few hard problems that earn it. That balance is the whole case for personal compute, and the practical engineering of it — runtimes, speed, what actually fits — is laid out in running GPT-4-class models on your laptop.
Tell me your RAM
Strip away the model names, which will turn over by autumn, and the method survives. Check your memory. Pick the job you do most. Read the cell. Install at Q4. If you’re unsure whether your laptop is decent or developer-grade, install the smaller pick first — disk space is cheap and the model loads in a minute.
Two years ago, the model now sitting in a file on your laptop would have been frontier-class and gated behind an API key. Today it’s a download, a license you should actually read, and a tier you already own. The leaderboard will keep churning; the map won’t. Tell me your RAM and the job, and the answer is one line in a terminal away.


