Local AIField GuideModelsMay 13, 202610 min read

Choosing a local model in 2026: a flowchart

Roughly forty open-weight models worth installing. You need one. Tell me what you want to do and what laptop you have — here's a five-by-five grid with the pick in each square.

By Atul

A flowchart, not a leaderboard

5 × 5

tasks × hardware tiers

Twenty-five squares. One recommendation in each. No leaderboard to refresh weekly.

~40

open-weight models worth a serious look in 2026

Across Llama, Qwen, Mistral, DeepSeek, Gemma, Phi, and a long tail of fine-tunes.

60s

to a decision you can install tonight

Pick the row. Pick the column. Read the cell. Skip the analysis paralysis.

There are roughly forty open-weight models worth installing in 2026. Most articles that try to rank them look like Olympic medal tables, get out of date in a fortnight, and start with three paragraphs of Mixture-of-Experts jargon before they get anywhere near a recommendation. If you just want to know what to download on a Tuesday night, that’s not very useful.

Here’s a flowchart instead. Tell me what you want to do — chat, code, reason, see images, read long documents — and what kind of laptop you have. I’ll point at one model and one quant. If you change your mind next month, the grid is still there. Most of the post is the grid; everything else is just enough context to read it right.

Two forest paths diverging between tall trees, evoking a choice between options. — Forty models, one decision. The flowchart turns the long tail into a single grid square. Photo by Jens Lelie on Unsplash.

The pick, in one grid

Rows are tasks — the five jobs people actually open a chat app to do. Columns are hardware tiers in plain English, from a thin-and-light laptop to a workstation. Each cell is the model I’d install myself, at the quantization level that fits, with no hand-wringing. Cells get faded when they don’t fit; you’ll see “move up a tier” in plain language instead of an apologetic note.

Pick the row, pick the column, install the cell

Task ↓ Hardware →

Any modern laptop

8 GB

Decent laptop

16 GB

Developer laptop

32 GB

Loaded MacBook / 4090

64 GB

Workstation

128 GB+

Chat & writing

Daily driver, drafts, rewrites

Llama 3.2 3B

Q4 · ~2 GB

Llama 3.1 8B

Q4 · ~5 GB

Gemma 3 12B

Q4 · ~7 GB

Gemma 3 27B

Q4 · ~16 GB

Llama 3.3 70B

Q4 · ~42 GB

Coding

Autocomplete, file edits

Qwen2.5-Coder 3B

Q4 · ~2 GB

Qwen2.5-Coder 7B

Q4 · ~5 GB

Qwen2.5-Coder 14B

Q4 · ~9 GB

Qwen2.5-Coder 32B

Q4 · ~20 GB

Qwen2.5-Coder 32B

Q6 · ~27 GB

Or 70B class

Reasoning & math

Multi-step, proofs, tough debugs

Move up a tier

Reasoning models start at 7B

R1-Distill-Qwen 7B

Q4 · ~5 GB

R1-Distill-Qwen 14B

Q4 · ~9 GB

R1-Distill-Qwen 32B

Q4 · ~20 GB

R1-Distill-Llama 70B

Q4 · ~42 GB

Vision (image in)

Screenshots, charts, photos

Gemma 3 4B

Q4 · ~3 GB

Gemma 3 4B

Q4 · ~3 GB

Or Qwen2.5-VL 7B

Gemma 3 12B

Q4 · ~8 GB

Gemma 3 27B

Q4 · ~17 GB

Qwen2.5-VL 72B

Q4 · ~44 GB

Long-document Q&A

Contracts, codebases, PDFs

Move up a tier

Long context needs headroom

Llama 3.1 8B

Q4 · ~5 GB + 8K ctx

Mistral Small 3

Q4 · ~14 GB · 32K ctx

Gemma 3 27B

Q4 · ~17 GB · 128K ctx

Llama 3.3 70B

Q4 · ~42 GB · 128K ctx

Sizes are approximate Q4 footprints on disk and in memory at typical context. Add 1–4 GB for working memory and KV cache, more if you push context to 32K+.

Two notes before the matrix gets analyzed. First, every model in it is a real release with weights on Hugging Face and a permissive license you can actually use for work — no “sign here, agree to train us, region-locked” nonsense. Second, the quants assume the widely-supported GGUF / llama.cpp Q4_K_M family or Apple’s MLX 4-bit on Macs — if you’re running 8-bit or full precision, walk one column to the left and you’ll be fine. Quantization is the next section because it earns it.

The five tasks, in plain English

The matrix only works if the row labels mean the same thing to you and to me. Here’s what I mean by each task — in normal sentences, not benchmark names.

Five tasks · pick the one you do most

Chat & writing

The daily driver. Drafting emails, rewriting paragraphs, brainstorming, summarizing what you just read, holding a casual back-and-forth.

Pick this row if 80% of what you do is just type and read.

Coding

Inline completions in your editor, whole-function generation, refactoring across a file, explaining what a chunk of unfamiliar code does.

Pick this row if you spend more time in an IDE than a chat window.

Reasoning & math

Hard multi-step problems where the model needs to plan, check itself, and not skip a step. Proofs, system-design questions, gnarly debugging.

Pick this row when you’d wait a minute for a correct answer.

Vision (image in)

Drop a screenshot, a chart, a receipt, a handwritten page in and get the text or the explanation back. Multimodal in, text out.

Pick this row if your inputs aren’t always text.

Long-document Q&A

Stuff a contract, a manual, a codebase, or a 200-page PDF into the prompt and ask questions against it. Needs context length and headroom for the KV cache.

Pick this row when the source material is bigger than a chat thread.

Agents — the whole “model with tools that browses the web and edits files for you” category — isn’t on this list. That’s a separate post, partly because the bottleneck is almost never the model and partly because the honest answer for most people is still “use a cloud frontier model for that one.”

The five hardware tiers

Memory is the wall. CPUs and GPUs matter for how fast tokens come out; memory matters for whether the model loads at all. For Apple Silicon Macs, the relevant number is the unified-memory size that you saw on the spec sheet when you bought it. For x86 boxes with a separate GPU, what matters is the VRAM on the card, not the system RAM — a machine with 64 GB of DDR5 and an 8 GB GTX is a small tier, not a big one.

Five tiers · feel-words first, GB second

Any modern laptop

8 GB

Base M1/M2/M3 MacBook Air, most ultrabooks, a $700 work laptop from last year.

Models up to about 4B parameters at 4-bit. Real, useful, fits.

Decent laptop

16 GB

The default M-series MacBook Pro, a Framework 13 with a decent spec, mid-range gaming laptops.

7–9B models comfortably; 12B with care. Where most readers live.

Developer laptop

32 GB

Mid-spec MacBook Pro, ThinkPad P-series, gaming laptops with an RTX 4060/4070 in 8 GB VRAM mode (still small for big models).

12–14B daily; 27–32B at the edge with a tight context.

Loaded MacBook / 4090 box

64 GB

M4 Pro/Max with 64 GB unified, a desktop with an RTX 4090 (24 GB) plus 64 GB system, Strix Halo machines.

27B with room to breathe; 32B Coder at Q4–Q6; 70B if you’re patient.

Workstation

128 GB+

M4 Max 128 GB, Mac Studio Ultra (up to 512 GB), a dual-3090 / dual-4090 rig, Strix Halo with the full unified pool.

70B class comfortably; the door opens to MoE distillations and very long contexts.

The two columns most readers actually live in are “decent laptop” and “developer laptop.” That’s where the picks below get the most attention.

A MacBook on a wooden desk, open to a code editor. — Decent laptop vs developer laptop: the column you live in decides which family you live with. Photo by Kari Shea on Unsplash.

The families, one line each

The matrix shrinks to five names because the open-weight world has consolidated around five families that ship a meaningful new version every quarter or two. Here’s the honest one-line read on each, independent of which benchmark is leading the leaderboard this afternoon.

Five families · their actual sweet spot

Llama

Meta ↗

The safe default. Most-tested, best ecosystem support, runs first on every new runtime.

Chat, writing, and long-doc Q&A at 8B and 70B.

Qwen

Alibaba ↗

Surprisingly strong at code and at every language that isn’t English. The dense Coder variants punch above their weight.

Coding (Qwen2.5-Coder) and vision (Qwen2.5-VL).

Mistral

Mistral, FR ↗

Efficient, permissively licensed, European. Mistral Small 3 is the quiet workhorse nobody talks about enough.

32K-context daily driver in the 16–32 GB band.

DeepSeek

DeepSeek ↗

Reasoning-leaning. The R1-Distill family bakes long internal “thinking” into smaller weights you can actually run.

Multi-step reasoning, math, hard debugging at 7–70B.

Gemma

Google DeepMind ↗

Small and tidy. Vision is built in across sizes. Generous 128K context. Lovely fit for tight memory budgets.

Vision and long-context at every tier.

A few names are conspicuously not on this list. Phi (Microsoft) is interesting at the very bottom of the size range but gets out-competed by Gemma and Llama 3.2 in the 1–4B class on general use. Yi and Command Rwere big in 2024 and haven’t shipped a 2026-relevant flagship. Kimi and GLM are excellent but their flagship checkpoints are too large for any laptop. Phi-4I’d try if you’re short on memory — otherwise pick from the five.

A note on quantization

Quantization is the trick that makes any of this work on a laptop. The original weights live in 16-bit floating point; quantization rewrites them at 4 or 5 bits per weight, cutting the memory footprint by roughly four times. The catch used to be quality loss. As of 2026, with the modern K-quant and IQ schemes in llama.cpp, the catch is mostly gone — Q4_K_Mgives you roughly 95% of the full-precision quality at a quarter of the bytes, and that ratio is now widely reproduced across the major model families. Apple’s MLX 4-bit, the equivalent on Macs, lands in the same neighborhood.

Practically, this means the column in the matrix is sized in memory at 4-bit. If you have the headroom to run 5-bit or 8-bit, you’ll get marginally better quality at noticeably slower speed and double the memory footprint. For most chat and code tasks it isn’t worth it. For high-stakes reasoning or code that has to compile the first time, walk up to 5-bit or 6-bit and accept the cost. There’s a diminishing-returns curve here and Q4 is the knee.

What the flowchart doesn’t tell you

A flowchart is a compression. Compressions throw things away. Three things worth saying out loud before you go install.

The model is not the runtime.Which one of llama.cpp, Ollama, LM Studio, MLX, vLLM, or ExLlama you pick decides your tokens-per-second more than the model does, sometimes by 2×. That’s its own post. The short version: on Apple Silicon, MLX and LM Studio’s MLX backend beat llama.cpp by 15–30% on most models. On NVIDIA, vLLM and ExLlama beat everything else by a similar margin. Otherwise, llama.cpp is the “just works” default.
Coding models have a special category.Qwen-2.5-Coder and DeepSeek-Coder are tuned heavily for code in a way the base models aren’t. If coding is more than half of what you’ll do with the model, switch the cell to the Coder variant at the same size — the matrix would otherwise have two columns per task and that’s a different post.
Reasoning models think out loud and that’s the point. The DeepSeek-R1-Distill family writes thousands of tokens of internal “thinking” before the final answer. That’s a feature for hard problems and a tax on easy ones. If you’re mostly chatting, don’t pick the reasoning row.
Frontier cloud is still better. Frontier-cloud GPT, Claude, and Gemini class models still beat the local picks on the hardest tasks. The honest 2026 workflow is local for the daily 80% and cloud for the heavy hitter when the local model gives up — and BYOK keeps the cloud bill honest when you reach for it.
The frontier moves under you.Llama 4 lands, Qwen drops a new SKU, DeepSeek surprises everyone again. The matrix shapes won’t change much: the cell labels will swap. Reread this post a couple of times a year, or read the family one-liners above instead of the model names.

The 60-second answer

If you skip the rest of this post:

Three questions · one model

What will you do most?

Chat / code / reason / see images / read long docs. Pick one.

What laptop did you buy?

Read the tier list, pick the feel-word that matches. The GB number is a hint, not a rule.

Read the cell.

That’s the model. Install it at Q4. Walk to Q5/Q6 only if quality matters more than speed.

That’s the entire flowchart. The matrix above just spells out what fits in each box. If you’re not sure whether your laptop is decent or developer, read the tier descriptions and trust the feel-words — the GB cutoff is a hint, not a rule. And if you’re still not sure, install the smaller pick first. Disk space is cheap. The actual cost of running models locally is the time you spend not picking.

Choosing a local model in 2026: a flowchart

The pick, in one grid

The five tasks, in plain English

The five hardware tiers

The families, one line each

A note on quantization

What the flowchart doesn’t tell you

The 60-second answer

AI for students who don't want to cheat

Offline AI is more useful than you think

Subscriptions ate my month — and AI shouldn't join the pile

One-time payment. Yours forever.