Local AIPrivacyHardwareMay 7, 202611 min read

Personal compute is back: AI is moving off rented GPUs

Open weights caught up. Unified memory hit 128 GB. Quantization stopped lying. The honest case for running AI on your own machine in 2026 — with the cost-crossover math, the hardware floor, and where it still hurts.

By Atul

128 GB

M4 Max unified memory

Enough to hold a 70B model at Q4 with room for context — at 546 GB/s of bandwidth feeding the GPU

~50 t/s

Llama 8B on a MacBook

Q4 on an M3 Max — faster than most people read, no network round-trip

Marginal cost per million tokens

vs $2.50 / $10 on GPT-4o and $3 / $15 on Claude Sonnet 4.6

For the last three years, the assumed shape of an AI workload was: prompt leaves the laptop, lands on someone else’s GPU, output comes back. The data center was the only place a serious model could live. Anything you could fit on a laptop was a toy.

That assumption is quietly breaking. Open-weight models have caught up to where the closed frontier was eighteen months ago. Apple shipped 128 GB of unified memory at 546 GB/s in a laptop you can buy. AMD answered with Strix Halo — 128 GB unified memory in an x86 box. Quantization stopped lying about quality. The runtimes grew up. And the cloud bills stopped feeling small.

This post is the honest version of where personal compute sits in 2026 — what you actually get, where the cost crossover lives, what the hardware floor is for each model class, and where running locally still hurts. No pitch. The cloud still wins on a few axes that matter; the point is that it no longer wins by default.

A close-up of a CPU chip seated on top of a motherboard, gold contacts catching the light. — Unified memory and a quiet 40–80 W draw — the data center has competition on the desk. Photo by Andrew Dawes on Unsplash.

Why now

Four things compounded in roughly the same eighteen months. None of them alone would have moved the needle. Together they shifted the default.

Open weights closed the gap. Llama 3.3 70B scores 86.0 on MMLU — within noise of the original GPT-4. DeepSeek-V3 and Qwen 3 sit in the same neighborhood. Epoch AI’s tracking puts the gap between the closed frontier and what runs on a sub-$2,500 consumer machine at roughly six months, and shrinking.
Unified memory got serious. The M4 Max ships up to 128 GB at 546 GB/s. The M3 Ultra Mac Studio scales to 512 GB. Strix Halo brings the same architecture — single LPDDR5X pool shared between CPU and GPU — to AMD machines, with up to 96 GB allocatable as VRAM. Discrete cards still rule on tokens-per-second, but they max out at 32 GB on consumer SKUs — which means a 70B at Q4 doesn’t even fit.
Quantization stopped lying. Q4_K_M and the newer Q4_K_S / IQ4 schemes in llama.cpp give you ~95% of the full-precision quality at a quarter of the bytes. Two years ago this was a research claim. Now it’s the assumed default and the leaderboards look the same with or without it.
The runtimes finished growing up. Ollama, LM Studio, MLX, and llama.cpp ship as one-line installs. LM Studio added an MLX engine to take advantage of Apple’s native framework. Tool-calling, structured outputs, vision — they all work locally now. The era of “you need a PhD to run a model offline” ended sometime in 2024.

The fifth, harder-to-name driver is BYOK fatigue. Anyone who has spent a year handing keys to four cloud providers, watching pricing tables shift quarterly, and re-reading data-retention clauses is primed to hear the alternative. Andrej Karpathy’s 2025 LLM year in review mentions in passing that he’s buying his next laptop with enough unified memory to fit 2026’s open frontier. That’s the signal.

The cost crossover

The cost argument is real but it’s subtle. Local inference is free at the margin — your laptop draws 40–80 W under sustained inference, which is electricity in the rounding-error range. The hardware is not free. The question is how many tokens you have to push through before the laptop pays for itself.

Three rough user profiles, all running through Claude Sonnet 4.6 at $3 / $15 per million tokens and assuming the mix is roughly 1:5 input-to-output:

Cumulative cost over 24 months · Claude Sonnet 4.6 ($3 / $15 per million)

Violet bar is the cloud bill at that workload over two years. Amber line is the one-time cost of a $4,000 M4 Max 64 GB MacBook Pro — anything past the line is money you would have spent on cloud instead.

Light (~5M output tok/mo)

Chat, the occasional rewrite

$2,160

Medium (~20M output tok/mo)

Daily workflow + light agent runs

$8,640

Heavy (~80M output tok/mo)

Agent loops, batch analysis, code review

$34,560

Cloud cost (24 mo)$4,000 hardwareCrossover: light ≈ never · medium ≈ 11 mo · heavy ≈ 3 mo

For light users, the laptop never pays back — you’re better off renting. For medium users (somebody using AI in their daily workflow, running a few thousand prompts a month with chunky outputs) it crosses over inside two years. For heavy users — agentic loops, code review pipelines, batch document analysis — the M4 Max pays for itself in months, not years, and from then on every token is essentially free.

Three caveats this picture hides. First, prompt caching changes the math the other way: if your workload is mostly cache-friendly, cloud effective rates can drop 50–90%, and the local payback gets longer. Second, the laptop is also a laptop — the marginal cost of buying it is much lower than the sticker price if you’d be buying one anyway. Third, the cloud bill scales with the team; a single $4,000 desktop AI box amortizes across whoever sits at it.

The honest framing is: cost crossover is real for power users, weak for casual users, and almost irrelevant for teams that want per-employee billing. The privacy and offline arguments — covered below — are the ones that hold across all three cases.

The hardware floor

The single most useful thing you can know before buying hardware is the floor — what RAM does each model class actually require, and what does it run at on the machines you’d realistically own. The relationship is non-linear and full of gotchas.

What runs where · weights, RAM floor, sustained tokens-per-second

Model class

Weights

RAM floor

Apple Silicon

Discrete GPU

8B (Llama 3.1, Qwen 3 8B)

~5 GB at Q4

8 GB

~50 t/s · M3 Max

~213 t/s · RTX 5090

32B (Qwen 3 32B, Gemma 4 27B)

~20 GB at Q4

32 GB

~25 t/s · M3 Max

~61 t/s · RTX 5090

70B (Llama 3.3 70B)

~40 GB at Q4

64 GB unified

~12 t/s · M4 Max

~8–15 t/s · 4090 with offload

405B / 600B+ (Llama 3.1 405B, DeepSeek-V3)

~200–400 GB at Q4

256 GB+ unified

~5–8 t/s · M3 Ultra 256 GB

Multi-GPU only

Two patterns emerge from that table. RAM is the wall, not flops. Llama 3.3 70B at Q4_K_M is ~40 GB of weights before context. An RTX 4090 has 24 GB. An RTX 5090 has 32 GB. Neither fits. To run a 70B on a discrete consumer card you have to offload layers to system RAM, and tokens-per-second collapses by 4–10x. Apple Silicon doesn’t have this problem because the GPU addresses the same memory the CPU does — that’s the entire reason a $4,000 MacBook outruns a $2,000 GPU on 70B models.

The second pattern: 8B is everywhere, 70B is the sweet spot, and anything bigger is desktop-only. An 8 GB MacBook Air runs an 8B model fine. A 32 GB machine is the comfortable floor for production work. For 70B-class quality you need 64 GB minimum (Apple Silicon) or Strix Halo’s 128 GB unified pool. For DeepSeek-V3-class 600B+ models, you’re in M3 Ultra territory at $4,000–$10,000 for a desk machine.

Where it still hurts

The case for personal compute would be a pitch if it didn’t admit the costs. Five things still hurt, ranked by how often they catch people:

The frontier-model gap is real.A local 70B is GPT-4-class. It is not Claude Opus 4.7 or GPT-5 Pro. On long-horizon reasoning, multi-step planning, or genuinely novel coding problems, the frontier closed models still win and it’s not close. If your work routinely needs the top-of-the-stack model, local is a supplement, not a replacement.
Tokens-per-second on the heavy models. Llama 3.3 70B at Q4 runs at 10–18 tok/s on an M3/M4 Max. Cloud GPT-4o serves over 100 tok/s reliably. For interactive use you adapt — you read at the speed it generates and it’s fine — but if you’re piping the output into another model in a chain, every step is felt.
Setup tax outside the Apple lane. Linux + AMD ROCm, Windows + WSL + CUDA, BIOS settings on Strix Halo to expose enough VRAM — the configuration paths are still rough. The Apple side is one click; the rest of the world is a weekend project.
Power and heat on laptops. Sustained inference pushes the M4 Max to ~110 W. That’s well within thermal limits and there’s no throttling — but the fans run, the battery drains in three hours, and you notice. A 450 W RTX 4090 in a desktop is a different conversation.
Tools don’t come for free. ChatGPT and Claude.ai bundle web search, code execution, file analysis, image generation. A local 70B is just a model. You — or your desktop app — have to wire those tools yourself. Worth it for many workloads; something to know going in.

The flip side of every one of these is the privacy story. Local inference doesn’t make data leakage harder — it makes it impossible. There is no log to subpoena, no third-party vendor to breach, no quiet policy change that retroactively retains the conversation you thought was deleted. For regulated work or anything genuinely confidential, that’s the only argument that matters and the cloud has no answer to it.

What this means

For most people the right pattern in 2026 is hybrid. Run the small models locally for everything that doesn’t need a frontier brain — chat, summarization, formatting, code edits, the long tail of prompts that fire dozens of times a day. Reach for cloud when you need the absolute best reasoning, when you’re generating an hour of video, or when a workload genuinely needs hundreds of parallel GPUs you don’t want to own.

The buy-now framing for hardware: 32 GB unified memory is the floor for serious work, 64 GB is the comfortable middle, and if you can swing 96–128 GB you’re good through the next generation of open weights. A used M3 Max 64 GB MacBook is probably the highest cost-performance ratio on the market right now; a Mac Studio M4 Max with 64–128 GB is the desktop answer; Strix Halo mini-PCs are the x86 alternative if Linux is your shop.

CSuite is built on the assumption this shift is real and ongoing. Bring your own keys for the cloud frontier when you need it, run Llama / Qwen / Gemma / Mistral locally for everything else, and keep every artifact on your disk. Whether you use CSuite or not, the point is the same: the data center is no longer the only place a serious model can live, and that changes the shape of what an AI workflow looks like. The desk is back in the picture.

Personal compute is back: AI is moving off rented GPUs

Why now

The cost crossover

The hardware floor

Where it still hurts

What this means

Choosing a local model in 2026: a flowchart

AI for students who don't want to cheat

Offline AI is more useful than you think

One-time payment. Yours forever.