Local AIField GuideOpen SourceJune 15, 202610 min read

The best open AI models you can run locally right now

The model that nearly tied the closed coding leaders is a free download, if you pick the right one for your RAM and dodge the license traps.

What fits at each tier

June 2026

Tell me your memory. I’ll tell you what to download.

8 GB

Any modern laptop

Gemma 3 4B

Gemma 3 4B. Drafts, rewrites, and reads a screenshot, on a base MacBook Air.

16 GB

Decent laptop

Qwen 3.5 9B

Qwen 3.5 9B. A genuinely good daily chat-and-writing model, fully offline.

32 GB

Developer laptop

Devstral Small 2

Devstral Small 2. A 24B coding model that edits files and runs agentic loops.

64 GB+

Loaded Mac / GPU box

Qwen 3.5 35B-A3B

Qwen 3.5 35B-A3B. Frontier-adjacent reasoning at the speed of a 3B model.

Memory is the wall. The leaderboard tells you which model is smartest; your RAM tells you which one will actually load.

The open-weight model that sat within two points of the closed coding leaders this spring is a file you can download tonight and run with your Wi-Fi switched off. That is genuinely new. For most of the AI era, the good models lived on someone else’s servers and metered every question. In 2026 the best ones you can own are good enough for the daily eighty percent of real work (writing, coding, transcription, image generation, searching your own documents), and they cost nothing per use.

The catch is no longer capability. It’s choice. There are roughly forty open models worth installing, the good lists rot in a fortnight, and some of the licenses are traps that read as “free” right up until a lawyer looks. So here is a map. Tell me how much memory your computer has and what you want to do, and there’s one model you should download, organized by hardware tier, because that’s the only thing that decides what will actually load, with the license column most guides skip. This was written in June 2026; treat every name as a placeholder for “whatever is current in that slot.”

A laptop on a desk showing code in a terminal. — The frontier moved onto the laptop. The hard part now is picking which file to download. Photo by Goran Ivos on Unsplash.

Start with your RAM, not the leaderboard

Every “best local model” list gets this backwards. The model that tops a benchmark this week is useless to you if it won’t fit in your memory, and memory is the wall almost everyone hits first. On a Mac, the number that matters is the unified memory on the spec sheet you bought. On a PC with a graphics card, it’s the VRAM on the card, not the system RAM: a tower with 64 GB of DDR5 and an 8 GB GPU is a small tier, not a big one.

The arithmetic is simple. A model needs roughly its parameter count times the bytes used per weight. At full precision that’s about 2 bytes each, so a 7-billion-parameter model wants ~14 GB, too much for a normal laptop. Quantization is the trick that fixes it: rewrite each weight at fewer bits. At 4-bit (the common “Q4”) the rule of thumb is about 0.5 GB per billion parameters, so that same 7B model drops to ~4–5 GB and runs comfortably. The llama.cpp quantization tables spell out the exact sizes.

The worry used to be that 4-bit wrecked quality. It mostly doesn’t anymore. Q4_K_M, the popular setting, keeps roughly 95% of full-precision quality at a quarter of the bytes; Q8 is near-lossless; below Q4 the drop becomes visible, and reasoning and math are the first things to suffer. Practical reading: install at Q4, and only walk up to Q5 or Q6 for code that has to compile or proofs that have to hold.

A macro photograph of a green circuit board with densely packed components. — Quantization trades a sliver of precision for a quarter of the memory. That trade is what puts a serious model on a thin-and-light. Photo by Alexandre Debiève on Unsplash.

That gives four tiers most readers live in: 8 GB (any modern laptop, 1–4B models), 16 GB (a decent laptop, 7–9B comfortably), 32 GB (a developer laptop, 12–14B daily and a ~30B at the edge), and 64 GB or more (a loaded Mac or a desktop with a 24 GB GPU, where 27–32B models breathe and a 70B runs if you’re patient). If you only care about text and code, the five-by-five flowchart drills into those tiers in more detail; this post goes wider, across every modality.

The jobs that scale with your memory

Four jobs grow with the model, and the model grows with your RAM: chatting and writing, hard reasoning, coding, and reading images. Here is the pick in each square: the model I’d install myself, at the quant that fits.

Pick the row, pick your memory, install the cell

Job ↓ Memory →

8 GB

Any laptop

16 GB

Decent laptop

32 GB

Developer laptop

64 GB+

Loaded Mac / GPU

Chat & writing

Drafts, rewrites, summaries

Gemma 3 4B

Q4 · ~3 GB

Qwen 3.5 9B

Q4 · ~6 GB

Gemma 3 12B

Q4 · ~8 GB

Gemma 3 27B

Q4 · ~16 GB

Reasoning & math

Multi-step, proofs, hard bugs

Move up a tier

Starts ~7B

R1-Distill-Qwen 7B

Q4 · ~5 GB

R1-Distill-Qwen 14B

Q4 · ~9 GB

R1-Distill 32B

Q4 · ~20 GB

Code

Autocomplete, file edits

Qwen2.5-Coder 3B

Q4 · ~2 GB

Qwen2.5-Coder 7B

Q4 · ~5 GB

Devstral Small 2

Q4 · ~14 GB

Qwen3-Coder 30B-A3B

Q4 · ~20 GB

Vision (image in)

Screenshots, charts, photos

Moondream 3

Q4 · ~2 GB

Qwen3-VL 8B

Q4 · ~6 GB

Gemma 3 12B

Q4 · ~8 GB

Qwen3-VL 32B

Q4 · ~20 GB

Sizes are approximate Q4 footprints. Add 1–4 GB for working memory and longer context. Every pick is Apache- or MIT-licensed except Gemma (Gemma Terms of Use); the R1-Distills inherit the license of their Qwen or Llama base.

The grid shrinks to a handful of families because the open world has consolidated. Qwen (Alibaba, Apache 2.0) is the best-rounded family right now: strong at code, multilingual by default, and its sparse mixture-of-experts builds like the 35B-A3B activate only ~3B parameters per token, so they run at small-model speed from big-model weights. DeepSeek (MIT) leans into reasoning; you can’t fit the 1.6-trillion-parameter V4 on a laptop, but the R1-Distill models bake its long “thinking” into 7B, 14B, and 32B weights you can. Gemma (Google) is small, tidy, and ships vision across every size with a generous context window. Mistral (mostly Apache) is the efficient European workhorse, and Phi-4 (Microsoft, MIT) punches above its weight on math at the bottom of the size range.

One name is conspicuously fading: Llama. Meta shipped its first closed model this spring and walked away from open releases, so Llama 3.3 and 4 are now legacy picks rather than the safe default they were a year ago. The open frontier is carried by Qwen, DeepSeek, and Moonshot’s Kimi today, and notably, all three are Chinese labs.

Image, voice, and search fit almost anywhere

The other half of local AI barely touches the tier chart. Speech models, text-to-speech, and embedding models are small enough to run on almost any machine; image generation is the lone exception, leaning on the GPU rather than system RAM. Here’s the short list, with the one job that needs real hardware flagged.

Four jobs that don’t need a big machine

Make a picture

Apache / OpenRAIL++

FLUX.2 [klein] 4B · Qwen-Image-2.0 · SDXL

The one job that wants a real GPU or a recent Mac. FLUX.1.2 and FLUX [dev] look free but are non-commercial.

Speech → text

Apache / MIT

Whisper large-v3 · distil-Whisper · Moonshine

The small Whisper and Moonshine sizes transcribe faster than real time on a plain CPU.

Text → speech

Apache / MIT

Kokoro 82M · Chatterbox

Kokoro is 82M parameters and runs faster than real time on a CPU. F5-TTS sounds better but its weights are non-commercial.

Search your own files

Apache / MIT

Qwen3-Embedding 0.6B · BGE-M3

The unglamorous model that makes local RAG work. The 0.6B runs on a CPU; BGE-M3 handles 8K-token chunks.

The standouts are worth naming. Kokoro is an 82-million-parameter text-to-speech model: small enough to run faster than real time on a plain CPU, which makes it the default for local narration. For transcription, Whisper remains the accuracy benchmark, while Moonshine is the one to run on a Raspberry Pi or inside a live voice agent. On the image side, Qwen-Image-2.0 and FLUX.2 [klein] are the local stars, the latter fast enough for near-interactive generation on a consumer card. And the least glamorous pick on the page, Qwen3-Embedding 0.6B, is the model that turns a folder of PDFs into something you can actually ask questions of: the quiet engine behind every “chat with your documents” feature.

A studio condenser microphone in soft light. — The best local voice model is faster than real time on a laptop CPU: no GPU, no API key, no upload. Photo by Leo Wieling on Unsplash.

“Open” is not one word, and it can get you sued

Here is the column nobody prints, and the one most likely to cost you. “Open” covers three different permissions, and the gap between downloading a model and shipping a product on it is where people get hurt.

Three buckets the word “open” hides

Apache 2.0 / MIT

Do anything

Qwen · DeepSeek · Mistral (open line) · Phi-4 · Kokoro · Whisper · Qwen-Image-2.0 · FLUX.2 klein 4B · BGE-M3

Custom community licenses

Open weights, strings attached

Llama (free unless you have 700M+ users) · Gemma (Gemma Terms + a use policy Google can enforce)

Non-commercial only

Look, don't ship

FLUX.1 [dev] / FLUX.1.2 · F5-TTS weights · Pixtral Large (research license)

Downloading a model and shipping a product on it are two different permissions. The middle and right columns let you do the first, not always the second.

The truly open bucket (Apache 2.0 and MIT) means what you hope it does: run it, fine-tune it, ship it, sell it, no fee. Qwen, DeepSeek, Mistral’s open lineup, Phi-4, Kokoro, Whisper, and Qwen-Image all live here. The middle bucket is “open weights with strings.” Meta’s Llama Community License is free for commercial use unless your product crosses 700 million monthly active users: a cap no real open-source license contains. Google’s Gemma terms allow commercial use too, but bind you to a prohibited-use policy and reserve Google’s right to restrict usage.

The third bucket is the actual trap. FLUX.1 [dev] and its sharper successors are non-commercial (gorgeous, freely downloadable, and forbidden in anything you sell), while the sibling FLUX.1 [schnell] is Apache and fine. F5-TTS and a few others sit in the same spot: the weights are right there, the license says no. If you’re building a business, read the license before you fall in love with the demo. Owning the weights is half the own-it-versus-rent-it argument; the license is the other half.

Where local still loses

This is a field guide, not a sales pitch, so the honest caveats matter. Frontier closed models still win the hardest tasks: an independent US government evaluation this spring put the best open model roughly eight months behind the closed frontier on broad capability work. The best open models, the trillion-parameter Kimi and DeepSeek builds, need a server, not a laptop. Quantization that fits a model into your RAM does cost a little quality. And setup is a real tax: a runtime to install, a multi-gigabyte download, the occasional model that won’t load.

Which is why the honest recommendation is hybrid, not absolutist. Run the local model for the daily eighty percent (the drafting, the transcription, the quick code, the private documents) and reach for a frontier cloud model on the few hard problems that earn it. That balance is the whole case for personal compute, and the practical engineering of it (runtimes, speed, what actually fits) is laid out in running GPT-4-class models on your laptop.

Tell me your RAM

Strip away the model names, which will turn over by autumn, and the method survives. Check your memory. Pick the job you do most. Read the cell. Install at Q4. If you’re unsure whether your laptop is decent or developer-grade, install the smaller pick first: disk space is cheap and the model loads in a minute.

From zero to a running model, in two lines

# install Ollama, then:

ollama pull gemma3:12b # download once (~8 GB)

ollama run gemma3:12b # chat, fully offline

Prefer a window to a terminal? LM Studio gives you the same thing with a model browser and a download button.

Two years ago, the model now sitting in a file on your laptop would have been frontier-class and gated behind an API key. Today it’s a download, a license you should actually read, and a tier you already own. The leaderboard will keep churning; the map won’t. Tell me your RAM and the job, and the answer is one line in a terminal away.

Disclaimer: The license descriptions in this post are plain-English summaries, not legal advice. Model licenses change between releases, and what counts as permitted use depends on your situation. Read the license text that ships with the model, and consult counsel before relying on it in a product. Details reflect sources available as of June 2026.

The best open AI models you can run locally right now

Start with your RAM, not the leaderboard

The jobs that scale with your memory

Image, voice, and search fit almost anywhere

“Open” is not one word, and it can get you sued

Where local still loses

Tell me your RAM

Text-to-video in 2026: what a sentence gets you now

AI voiceovers without a studio: podcasts, videos, and audiobooks

Which AI should I use? A plain guide to picking one

One-time payment. Yours forever.