CSuite
ExplainerLocal AIModelsJuly 1, 20269 min read

What is quantization? How a giant AI model fits on your laptop

A 70-billion-parameter model is a 141 GB download. You run the 43 GB copy and can't tell them apart. That's quantization.

By Atul
One model, four sizes · Llama 3.3 70B
Download size
Same model. A fifth of the file.
F16
141 GB
Q8_0
75 GB
Q4_K_M
43 GB
Q2_K
26 GB
The 4-bit version is 30% of the full file and scores within half a percent of it on standard benchmarks. Sizes from Ollama.

Meta’s Llama 3.3 70B, in the form its makers released it, is a 141 GB download. That will not fit on any laptop you can buy. Yet people run this model on MacBooks every day. They just download a different copy of it: 43 GB, less than a third the size, and in ordinary use they cannot tell the two apart.

The shrinking trick has a name, and it is quietly the most important idea in running AI on your own hardware. Nearly every “run it locally” guide you have read depends on it without stopping to explain it. This post stops to explain it. The stakes are simple: get it right and a frontier-scale model lives on the machine you already own, with no API bill and no data leaving your desk. Get it wrong and you either can’t fit the model or you cripple it.

Quantization lowers the numerical precision of a model’s weights so the same model takes far less memory. Done well, it cuts the file to roughly a quarter of its size while leaving quality almost untouched. Once you see how it works, one practical decision falls out of it, and it’s a decision you’ll make every time you download a model.

A model is a giant pile of numbers

Strip away the mystique and a language model is a giant list of numbers called weights. A 70-billion-parameter model has 70 billion of them. Each weight is a small decimal (something like 0.0413) that the model learned during training, and running the model means multiplying your input through all of them.

Every one of those numbers has to be stored, and how you store a number is a choice. A computer can write a decimal with a lot of precision or a little. High precision means more digits after the point and more bits of storage per number. Low precision means fewer. The number 0.0413 might become 0.04. Close enough for most purposes, and much cheaper to keep.

That is the whole trick. Multiply a tiny saving per number by billions of numbers and it turns into tens of gigabytes. Quantization is the careful version of rounding every weight in the model to fewer bits.

A close-up of a computer memory module showing its chips.
Every weight has to live in memory to run. Quantization is the difference between a model that fits in the RAM you own and one that doesn’t. Photo by William Warby on Unsplash.

Fewer bits per number, a much smaller file

Models are usually trained at 32-bit or 16-bit precision, meaning 4 bytes or 2 bytes of storage for each weight. Quantizing drops that to 8 bits (1 byte), or 4 bits (half a byte), sometimes lower. Because the file size is just bits-per-weight multiplied by the number of weights, the arithmetic is refreshingly blunt: halve the bits, halve the file.

Hugging Face lays out the ladder in its bitsandbytes guide: a 7-billion-parameter model needs about 28 GB at 32-bit, 14 GB at 16-bit, and 3.5 GB at 4-bit. Same model, same 7 billion weights. The only thing that changed is how many bits each number gets.

Bits per number set the whole bill
Memory for a 7-billion-parameter model at each precision. Halve the bits, halve the file.
Precision
Per number
7B model
What it is
FP32
4 bytes
28 GB
Original training precision
FP16
2 bytes
14 GB
Standard “full” release
Int8
1 byte
7 GB
Halved again
4-bit
0.5 bytes
3.5 GB
Runs on a phone
Per-number sizes from Hugging Face’s bitsandbytes guide. A weight is just a number; store it in fewer bits and the model gets smaller in direct proportion.

This is why memory, not raw speed, is the gate on local AI. A weight has to sit in RAM to be used, so the quantized size is exactly what decides whether a model fits on your machine at all. It is the same math behind the 128 GB laptops that can now hold a 70B model, and the reason a phone can run a small one. Shrink the numbers and the hardware requirement shrinks with them.

One clarification, because the terms get muddled. Quantization keeps every one of the model’s weights. It only stores each in fewer bits. That makes it different from two other shrinking methods: pruning, which deletes weights the model can spare, and distillation, which trains a smaller student model to imitate a larger one. Those change what the model is. Quantization changes only how precisely it is written down, which is why the same model at 4-bit still behaves like itself.

Four-bit costs almost nothing in quality

The obvious worry: if you round off every number, doesn’t the model get dumber? A little. Far less than intuition suggests. The reason is that a neural network never relied on those trailing digits in the first place. Its answers come from billions of numbers voting together, and the vote barely moves when each vote is rounded slightly.

The measurements bear this out. A 2026 evaluation of llama.cpp quantization on Llama-3.1-8B found the popular 4-bit setting scored 69.15 on an aggregate benchmark against 69.47 for the full-precision model. That is a loss of under half a percent, in exchange for cutting the model to under a third of its size. In blind chat, nobody notices.

A white plastic caliper with fine millimeter gradations on a blue surface.
A ruler marked to a tenth of a millimeter and one marked to the centimeter measure the same table equally well. Most of a model’s precision is detail nobody needs. Photo by Ag PIC on Unsplash.

Think of it like measuring tools. A caliper reading to a hundredth of a millimeter and a ruler marked in centimeters will both tell you the same table is about a meter wide. The extra precision is real, and for this job it is wasted. Most of a model’s 16-bit precision is detail the task never asks for, which is why throwing most of it away barely registers.

Below four bits, quality falls off a cliff

The catch is that this generosity runs out. Compression is nearly free down to 4 bits, then the losses stop being rounding errors and start being real. The llama.cpp project’s own measurements show the pattern cleanly on a 7B model: the extra perplexity (a standard error measure, where lower is better) is a negligible +0.05 at 4-bit, but jumps to +0.24 at 3-bit and +0.87 at 2-bit.

The quality cost, and where it cliffs
Extra perplexity vs full precision on a 7B model (lower is better; longer bar is worse).
Q8_0
+0.0004
Q6_K
+0.004
Q5_K_M
+0.014
Q4_K_M
+0.054
Q3_K_M
+0.244
Q2_K
+0.87
Perplexity deltas from the llama.cpp k-quants measurements. From 8-bit down to 4-bit the loss is a rounding error. Below 4-bit (amber) it jumps by an order of magnitude at each step.

That is close to an order of magnitude worse at each step below 4-bit. The same evaluation paper watched a 3-bit setting knock the model’s grade-school math score from 77.6 to 68.3, a nine-point drop that would be obvious to any user. The lesson is not “smaller is always fine.” It is that 4-bit sits right at the elbow of the curve: almost all the size savings, almost none of the quality cost.

The floor has been creeping down. A newer generation of 2-bit and 3-bit methods, the ones labeled with an IQ prefix, use an “importance matrix” that studies which weights the model leans on most and protects them, recovering some of the quality the old low-bit builds threw away. They are genuinely better than the naive versions. The cliff is gentler than it was, but it is still a cliff: for everyday use, 4-bit remains the line most people should not cross.

There is a bonus most people miss. A quantized model is usually not just smaller but faster. Generating text is limited less by raw math than by how many bytes the chip has to haul out of memory for each word. A 4-bit model moves roughly a quarter of the bytes a 16-bit one does, so on the same laptop it tends to produce tokens noticeably quicker. You shrink the file and speed up the replies in the same move.

What Q4_K_M actually means

Download a model and you meet a wall of cryptic labels: Q4_K_M, Q5_K_S, Q8_0, IQ2_XXS. They look like part numbers. They are readable once you know the parts.

The number is the headline bit count. Q4 is roughly 4 bits per weight, Q8 is 8, Q2 is 2. The Kmarks a smarter family called k-quants that do not round every weight equally. They spend more bits on the layers that matter most to the output and fewer on the rest, which is why a “4-bit” model averages closer to 4.5. The trailing S, M, or L is small, medium, or large: how generous that mixing is. Q4_K_M is the medium, balanced 4-bit build, and it is the one most people should take.

There is real engineering behind the “spend bits where they matter” idea. The influential QLoRA paper introduced a 4-bit format called NF4, designed to be, in its authors’ words, “information-theoretically optimal” for the way neural network weights actually cluster around zero. It let researchers fine-tune a 65B model on a single 48 GB GPU, and their quantized model reached 99.3% of ChatGPT’s score on a common benchmark. Careful 4-bit is not a hack. It is a well-studied sweet spot.

The one decision: which quant to download

Here is the practical payoff. When a model page offers a dozen quant options, you are really choosing one point on the size-versus-quality curve, and the curve tells you where to stand.

In practice you rarely quantize a model yourself. Someone has already done it and published the results as a file, usually in a container format called GGUF, the standard the popular llama.cpp engine reads. A tool like Ollama or LM Studio downloads one of those files for you. That is what all the labels are: pre-built copies at different bit depths, sitting on a download page, waiting for you to pick a row.

Which quant to download
Q8_0
You have RAM to burn and want an archival, near-lossless copy.
Q6_K / Q5_K_M
A little headroom to spare and you want the safest quality margin.
Q4_K_M
The default. Best size-to-quality trade for almost everyone.
Q3_K_M
Only if Q4 won't fit. You start to feel it on hard tasks.
Q2_K / IQ2
Last resort to squeeze a model in at all. Expect wobble.
When in doubt, take Q4_K_M. It is what Ollama hands you by default, and the reason is everything above.

The rule of thumb is short. Take Q4_K_M unless you have a specific reason not to. If you have spare memory and want the safest margin, step up to Q5 or Q6. Only drop to 3-bit or 2-bit when a model would not otherwise fit, and know you are trading real quality to do it. This is also the quiet logic behind most “best local model” roundups: the size they quote is almost always the 4-bit one.

A MacBook Pro on a dark office desk.
A frontier-scale model, quantized to 4-bit, runs on the machine already on your desk. No data center, no API bill. Photo by Julian Hochgesang on Unsplash.

Shrink it, then run it

Quantization is the bridge between the models labs release and the hardware you own. It is why a 141 GB frontier model becomes a 43 GB download that runs on a laptop, and why that laptop’s answers are nearly indistinguishable from the full-precision original. The mechanism is nothing fancier than rounding billions of numbers with care.

Keep three things. A model is a pile of numbers, and quantization stores each in fewer bits. Down to 4 bits the quality cost is a rounding error; below it, the floor drops away. And when you download a model, Q4_K_M is the default for a reason worth trusting. Quantization sits alongside the token as one of the few pieces of plumbing that, once you understand it, makes the whole machine feel less like magic and more like something you can size, budget, and run yourself.

More reading
Launch offer · 50% off

One-time payment. Yours forever.

No subscriptions. No seats. No renewals. Buy CSuite once, future updates included.

$98$49
Pricing

Secure checkout via Stripe. Already have a license? Download the app