Choosing a local model in 2026: a flowchart
Roughly forty open-weight models worth installing. You need one. Tell me what you want to do and what laptop you have — here's a five-by-five grid with the pick in each square.
There are roughly forty open-weight models worth installing in 2026. Most articles that try to rank them look like Olympic medal tables, get out of date in a fortnight, and start with three paragraphs of Mixture-of-Experts jargon before they get anywhere near a recommendation. If you just want to know what to download on a Tuesday night, that’s not very useful.
Here’s a flowchart instead. Tell me what you want to do — chat, code, reason, see images, read long documents — and what kind of laptop you have. I’ll point at one model and one quant. If you change your mind next month, the grid is still there. Most of the post is the grid; everything else is just enough context to read it right.

The pick, in one grid
Rows are tasks — the five jobs people actually open a chat app to do. Columns are hardware tiers in plain English, from a thin-and-light laptop to a workstation. Each cell is the model I’d install myself, at the quantization level that fits, with no hand-wringing. Cells get faded when they don’t fit; you’ll see “move up a tier” in plain language instead of an apologetic note.
Two notes before the matrix gets analyzed. First, every model in it is a real release with weights on Hugging Face and a permissive license you can actually use for work — no “sign here, agree to train us, region-locked” nonsense. Second, the quants assume the widely-supported GGUF / llama.cpp Q4_K_M family or Apple’s MLX 4-bit on Macs — if you’re running 8-bit or full precision, walk one column to the left and you’ll be fine. Quantization is the next section because it earns it.
The five tasks, in plain English
The matrix only works if the row labels mean the same thing to you and to me. Here’s what I mean by each task — in normal sentences, not benchmark names.
The daily driver. Drafting emails, rewriting paragraphs, brainstorming, summarizing what you just read, holding a casual back-and-forth.
Pick this row if 80% of what you do is just type and read.
Inline completions in your editor, whole-function generation, refactoring across a file, explaining what a chunk of unfamiliar code does.
Pick this row if you spend more time in an IDE than a chat window.
Hard multi-step problems where the model needs to plan, check itself, and not skip a step. Proofs, system-design questions, gnarly debugging.
Pick this row when you’d wait a minute for a correct answer.
Drop a screenshot, a chart, a receipt, a handwritten page in and get the text or the explanation back. Multimodal in, text out.
Pick this row if your inputs aren’t always text.
Stuff a contract, a manual, a codebase, or a 200-page PDF into the prompt and ask questions against it. Needs context length and headroom for the KV cache.
Pick this row when the source material is bigger than a chat thread.
Agents — the whole “model with tools that browses the web and edits files for you” category — isn’t on this list. That’s a separate post, partly because the bottleneck is almost never the model and partly because the honest answer for most people is still “use a cloud frontier model for that one.”
The five hardware tiers
Memory is the wall. CPUs and GPUs matter for how fast tokens come out; memory matters for whether the model loads at all. For Apple Silicon Macs, the relevant number is the unified-memory size that you saw on the spec sheet when you bought it. For x86 boxes with a separate GPU, what matters is the VRAM on the card, not the system RAM — a machine with 64 GB of DDR5 and an 8 GB GTX is a small tier, not a big one.
Base M1/M2/M3 MacBook Air, most ultrabooks, a $700 work laptop from last year.
Models up to about 4B parameters at 4-bit. Real, useful, fits.
The default M-series MacBook Pro, a Framework 13 with a decent spec, mid-range gaming laptops.
7–9B models comfortably; 12B with care. Where most readers live.
Mid-spec MacBook Pro, ThinkPad P-series, gaming laptops with an RTX 4060/4070 in 8 GB VRAM mode (still small for big models).
12–14B daily; 27–32B at the edge with a tight context.
M4 Pro/Max with 64 GB unified, a desktop with an RTX 4090 (24 GB) plus 64 GB system, Strix Halo machines.
27B with room to breathe; 32B Coder at Q4–Q6; 70B if you’re patient.
M4 Max 128 GB, Mac Studio Ultra (up to 512 GB), a dual-3090 / dual-4090 rig, Strix Halo with the full unified pool.
70B class comfortably; the door opens to MoE distillations and very long contexts.
The two columns most readers actually live in are “decent laptop” and “developer laptop.” That’s where the picks below get the most attention.

The families, one line each
The matrix shrinks to five names because the open-weight world has consolidated around five families that ship a meaningful new version every quarter or two. Here’s the honest one-line read on each, independent of which benchmark is leading the leaderboard this afternoon.
The safe default. Most-tested, best ecosystem support, runs first on every new runtime.
Chat, writing, and long-doc Q&A at 8B and 70B.
Surprisingly strong at code and at every language that isn’t English. The dense Coder variants punch above their weight.
Coding (Qwen2.5-Coder) and vision (Qwen2.5-VL).
Efficient, permissively licensed, European. Mistral Small 3 is the quiet workhorse nobody talks about enough.
32K-context daily driver in the 16–32 GB band.
Reasoning-leaning. The R1-Distill family bakes long internal “thinking” into smaller weights you can actually run.
Multi-step reasoning, math, hard debugging at 7–70B.
Small and tidy. Vision is built in across sizes. Generous 128K context. Lovely fit for tight memory budgets.
Vision and long-context at every tier.
A few names are conspicuously not on this list. Phi (Microsoft) is interesting at the very bottom of the size range but gets out-competed by Gemma and Llama 3.2 in the 1–4B class on general use. Yi and Command Rwere big in 2024 and haven’t shipped a 2026-relevant flagship. Kimi and GLM are excellent but their flagship checkpoints are too large for any laptop. Phi-4I’d try if you’re short on memory — otherwise pick from the five.
A note on quantization
Quantization is the trick that makes any of this work on a laptop. The original weights live in 16-bit floating point; quantization rewrites them at 4 or 5 bits per weight, cutting the memory footprint by roughly four times. The catch used to be quality loss. As of 2026, with the modern K-quant and IQ schemes in llama.cpp, the catch is mostly gone — Q4_K_Mgives you roughly 95% of the full-precision quality at a quarter of the bytes, and that ratio is now widely reproduced across the major model families. Apple’s MLX 4-bit, the equivalent on Macs, lands in the same neighborhood.
Practically, this means the column in the matrix is sized in memory at 4-bit. If you have the headroom to run 5-bit or 8-bit, you’ll get marginally better quality at noticeably slower speed and double the memory footprint. For most chat and code tasks it isn’t worth it. For high-stakes reasoning or code that has to compile the first time, walk up to 5-bit or 6-bit and accept the cost. There’s a diminishing-returns curve here and Q4 is the knee.
What the flowchart doesn’t tell you
A flowchart is a compression. Compressions throw things away. Three things worth saying out loud before you go install.
- The model is not the runtime.Which one of llama.cpp, Ollama, LM Studio, MLX, vLLM, or ExLlama you pick decides your tokens-per-second more than the model does, sometimes by 2×. That’s its own post. The short version: on Apple Silicon, MLX and LM Studio’s MLX backend beat llama.cpp by 15–30% on most models. On NVIDIA, vLLM and ExLlama beat everything else by a similar margin. Otherwise, llama.cpp is the “just works” default.
- Coding models have a special category.Qwen-2.5-Coder and DeepSeek-Coder are tuned heavily for code in a way the base models aren’t. If coding is more than half of what you’ll do with the model, switch the cell to the Coder variant at the same size — the matrix would otherwise have two columns per task and that’s a different post.
- Reasoning models think out loud and that’s the point. The DeepSeek-R1-Distill family writes thousands of tokens of internal “thinking” before the final answer. That’s a feature for hard problems and a tax on easy ones. If you’re mostly chatting, don’t pick the reasoning row.
- Frontier cloud is still better. Frontier-cloud GPT, Claude, and Gemini class models still beat the local picks on the hardest tasks. The honest 2026 workflow is local for the daily 80% and cloud for the heavy hitter when the local model gives up — and BYOK keeps the cloud bill honest when you reach for it.
- The frontier moves under you.Llama 4 lands, Qwen drops a new SKU, DeepSeek surprises everyone again. The matrix shapes won’t change much: the cell labels will swap. Reread this post a couple of times a year, or read the family one-liners above instead of the model names.
The 60-second answer
If you skip the rest of this post:
Chat / code / reason / see images / read long docs. Pick one.
Read the tier list, pick the feel-word that matches. The GB number is a hint, not a rule.
That’s the model. Install it at Q4. Walk to Q5/Q6 only if quality matters more than speed.
That’s the entire flowchart. The matrix above just spells out what fits in each box. If you’re not sure whether your laptop is decent or developer, read the tier descriptions and trust the feel-words — the GB cutoff is a hint, not a rule. And if you’re still not sure, install the smaller pick first. Disk space is cheap. The actual cost of running models locally is the time you spend not picking.


