EngineeringLocal AIPrivacyMay 5, 20269 min read

Run GPT-4 class models on your laptop without sending a single byte to the cloud

Open weights now match GPT-4 quality. Here's how CSuite runs them on your machine — no proxy, no logging, no tokens billed.

By Atul

MMLU · higher is better

GPT-4 · Mar 2023 · cloud

86.4

Llama 3.3 70B · 2026 · your laptop

86.0

±0.4

Pts behind GPT-4

~50 ms

Time to first token

0 bytes

Leaving the laptop

Somewhere in the last twelve months, the open-weight models quietly caught up. Llama 3.3 70B scores 86.0 on MMLU and 88.4 on HumanEval— both ahead of the original GPT-4 from March 2023 (86.4 / 67.0). It runs at roughly 10–18 tokens per second on a 64 GB MacBook Pro M3/M4 Max. That’s faster than you read.

CSuite is built on this fact. It runs frontier-quality text, image, audio, and video models on your machine, behind your own keys, with zero proxying through anything we operate. This post is a hands-on tour of how that works under the hood: which runtimes ship with the app, what gets downloaded on first run, where models live on disk, and how local actually stacks up against cloud APIs on speed, cost, and privacy.

What “GPT-4 class” means in 2026

The “GPT-4 class” bar is whatever the original GPT-4 achieved on standard benchmarks. As of mid-2025, DeepSeek-V3, Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2 and Llama 3.1 405B all clear it cleanly — and several clear GPT-4o and Claude 3.5 Sonnet on at least one axis.

Standard benchmark scores · higher is better

Closed-weight (cloud) in amber. Open-weight (runnable locally with enough RAM) in violet.

MMLU

HumanEval

GPT-4 (Mar 2023)

86.4

67.0

GPT-4o

87.2

90.2

Claude 3.5 Sonnet

88.3

92.0

Llama 3.3 70B

86.0

88.4

DeepSeek-V3

88.5

82.6

Qwen 2.5 72B

85.3

77.3

Numbers come from each model’s official release notes and the comparison table in DeepSeek-V3’s model card. Treat differences inside ±2 points as noise — eval harnesses vary and most published deltas reflect prompting more than capability.

Honest caveat about CSuite’s curated catalog.The models the app pulls from its built-in picker are tuned for moderate hardware (8–32 GB RAM): Gemma 4 (up to 31B), Llama 3.1 8B, Llama 3.2 3B, Mistral 7B, Qwen 3 4B. If you have a 64+ GB Mac and want Llama 3.3 70B or DeepSeek-V3, point CSuite at your own Ollama install (the app supports BYO via a base-URL setting) and ollama pull llama3.3:70b — CSuite will see the model and let you select it like any other. No catalog gating, no special treatment.

CSuite’s two local runtimes

Two engines do all the local work. Both are deliberate choices, not accidents of history.

Ollama — the easiest way to run a wide catalog of quantized text models. CSuite ships a bundled copy that listens on port 11435 (not the default 11434), so it never collides with an Ollama you already have. If you do have your own install running, set the base URL once and CSuite uses that instead — same binary, your own model cache, no double-downloads. Models pulled via either path live in the standard ~/.ollama/models/.
HuggingFace transformers.js + ONNX— a pure-JS inference path with native ONNX bindings, used for ONNX-exported variants of Gemma 4 and Qwen 3. Lower memory footprint than Ollama for the same model size and useful for image+audio+text multimodal flows that don’t fit GGUF cleanly. Runs in a long-lived utility process so it doesn’t block the UI.

The split exists because the two ecosystems handle multimodality differently. Ollama is gold-standard for text + tool calls and increasingly good at images; the Transformers.js / ONNX path is what makes Gemma 4’s text+image+audio chat work without three separate runtimes. Picking a model in CSuite picks the runtime — you don’t manage them by hand.

A MacBook Pro sitting open on a brown wooden desk in a quiet, minimal workspace. — The whole stack — runtime, model, KV cache — fits on the desk in front of you. Photo by Martin Katler on Unsplash.

What gets downloaded on first run

Open the app the first time and a six-step wizard runs. Nothing important is hidden:

Project folder. Pick a directory on your disk where everything you generate lands as a real file.
FFmpeg. Bundled binary, dropped in place. Used locally for every audio and video edit (crop, trim, transcode). Nothing leaves your machine for these.
Ollama.Either the bundled copy is downloaded (~50 MB) and started on port 11435, or you point at an existing install. Either way, no model is pulled yet.
First Ollama model.Pick one from the catalog. Llama 3.1 8B (~4.9 GB) is the recommended starting point on 8+ GB machines. The picker greys out models your machine doesn’t have RAM for and tells you why.
HuggingFace runtime.Downloads the pinned transformers.js + ONNX packages from npm (~300 MB) into the app’s data folder, with SRI integrity verification. This makes local image and audio models work; skip it and the text-only path still works.
Cloud provider keys (optional). Paste an API key for Replicate, Runware, etc. if you want frontier cloud models alongside local ones. Keys are encrypted at rest with the OS keychain (Keychain on macOS, DPAPI on Windows, Secret Service on Linux).

Disk footprint after a typical first-run install

Approximate sizes. Most of the bytes are model weights; runtimes themselves are tiny.

Ollama binary50 MB
HuggingFace runtime300 MB
Llama 3.1 8B (Q4)4.8 GB
Gemma 4 31B (Q4)19.5 GB

Total≈ 24.7 GB

Everything except your Ollama models lives under the app’s data directory: ~/Library/Application Support/csuite/ on macOS, %APPDATA%\csuite\ on Windows, ~/.config/csuite/ on Linux. Delete that folder and the app is reset to first-run state — your project files on disk stay untouched.

Local vs cloud, side by side

The interesting comparison isn’t local vs cloud on raw quality — the benchmarks above show they’re close. It’s the rest of the picture.

Local vs cloud — same prompt, different paths

Local

Cloud

Time to first token

~50–100 ms (Llama 8B Q4 on M3 Max)

0.4–1.0 s (GPT-4o P50 via API)

Output speed

~50 t/s (Llama 8B, M3 Max 40-core)

~135 t/s (GPT-4o)

Cost per million tokens

Electricity to keep the laptop awake

$2.50 input / $10.00 output (GPT-4o)

What leaves your machine

Nothing. Localhost or in-process only.

Every prompt, every output, every file.

Works offline

Yes

Subject to rate limits

Yes

The latency story is the most underrated. A local 8B model emits its first token in tens of milliseconds because there’s no DNS-lookup, no TLS handshake, no provider queue, no rate-limit gate between you and the GPU. GPT-4o’s P50 time-to-first-token via API hovers between 0.4 and 1.0 seconds depending on the provider. For interactive use — typing in a chat, triggering a quick rewrite — local feels different. It feels like a local app instead of a network app.

Privacy you can actually verify

“We don’t store your prompts” only goes as far as you trust the provider not to silently change their mind, get hacked, or get court-ordered. The last twenty-four months have a thick file of examples:

Samsung, 2023.Within twenty days of allowing ChatGPT internally, Samsung engineers leaked source code and confidential meeting transcripts into OpenAI’s training-eligible inputs across three separate incidents. Samsung subsequently banned generative AI on company devices.
Italy, 2023–2024. The Garante banned ChatGPT for GDPR violations in March 2023 and fined OpenAI €15 million in December 2024 for the original conduct.
NYT v. OpenAI preservation order, May 2025. A magistrate judge ordered OpenAI to retain all ChatGPT logs — including conversations users had explicitly deleted — for the 400+ million users on Free, Plus, Pro, Team, and standard API tiers. OpenAI publicly objected; Enterprise and zero-data-retention API endpoints were exempt.
Mixpanel breach, November 2025. OpenAI confirmed a third-party analytics vendor was compromised and limited business-customer data exfiltrated. Chat content wasn’t in scope, but the incident underscores supply-chain risk for any cloud-hosted AI.

Local inference doesn’t make any of those failure modes safer — it makes them impossible. There is no log to retain. There is no third-party vendor in the path. There is no court that can compel data that doesn’t exist. CSuite’s local generation runs in-process or in a localhost utility process; the network calls it makes are, by inspection, the ones you can see in your model picker.

Where this is going

Epoch AI publishes the cleanest version of the trajectory: the gap between the cloud frontier and what runs on a sub-$2,500 consumer GPU is now around six months, and shrinking — open models gain ~125 ELO per year on the LMArena leaderboard versus ~80 for closed ones. Andrej Karpathy publicly noted in his 2025 year-in-review that his next laptop will have ≥128 GB of unified memory specifically so 2026’s open-weight frontier fits.

What that adds up to: if you’re buying hardware in 2026, the sensible default is enough RAM to fit a 70B at Q4 — 64 GB on a Mac is the floor, 96–128 GB if you can afford it. A year from now that machine runs the models that ship today, and probably whatever ships next year too. Cloud is still useful for spikes — generating an hour of video, or running an enormous reasoning model that no laptop can hold — but it stops being the default delivery mechanism for the 90% of work that doesn’t need it.

That’s the bet CSuite is built around. Try it: pick the smallest model in the catalog, watch your laptop fan barely move, and notice that nothing in your task manager talked to the network during the run.

Run GPT-4 class models on your laptop without sending a single byte to the cloud

What “GPT-4 class” means in 2026

CSuite’s two local runtimes

What gets downloaded on first run

Local vs cloud, side by side

Privacy you can actually verify

Where this is going

Choosing a local model in 2026: a flowchart

AI for students who don't want to cheat

Offline AI is more useful than you think

One-time payment. Yours forever.