What is quantization? How a giant AI model fits on your laptop
A 70-billion-parameter model is a 141 GB download. You run the 43 GB copy and can't tell them apart. That's quantization.
Meta’s Llama 3.3 70B, in the form its makers released it, is a 141 GB download. That will not fit on any laptop you can buy. Yet people run this model on MacBooks every day. They just download a different copy of it: 43 GB, less than a third the size, and in ordinary use they cannot tell the two apart.
The shrinking trick has a name, and it is quietly the most important idea in running AI on your own hardware. Nearly every “run it locally” guide you have read depends on it without stopping to explain it. This post stops to explain it. The stakes are simple: get it right and a frontier-scale model lives on the machine you already own, with no API bill and no data leaving your desk. Get it wrong and you either can’t fit the model or you cripple it.
Quantization lowers the numerical precision of a model’s weights so the same model takes far less memory. Done well, it cuts the file to roughly a quarter of its size while leaving quality almost untouched. Once you see how it works, one practical decision falls out of it, and it’s a decision you’ll make every time you download a model.
A model is a giant pile of numbers
Strip away the mystique and a language model is a giant list of numbers called weights. A 70-billion-parameter model has 70 billion of them. Each weight is a small decimal (something like 0.0413) that the model learned during training, and running the model means multiplying your input through all of them.
Every one of those numbers has to be stored, and how you store a number is a choice. A computer can write a decimal with a lot of precision or a little. High precision means more digits after the point and more bits of storage per number. Low precision means fewer. The number 0.0413 might become 0.04. Close enough for most purposes, and much cheaper to keep.
That is the whole trick. Multiply a tiny saving per number by billions of numbers and it turns into tens of gigabytes. Quantization is the careful version of rounding every weight in the model to fewer bits.

Fewer bits per number, a much smaller file
Models are usually trained at 32-bit or 16-bit precision, meaning 4 bytes or 2 bytes of storage for each weight. Quantizing drops that to 8 bits (1 byte), or 4 bits (half a byte), sometimes lower. Because the file size is just bits-per-weight multiplied by the number of weights, the arithmetic is refreshingly blunt: halve the bits, halve the file.
Hugging Face lays out the ladder in its bitsandbytes guide: a 7-billion-parameter model needs about 28 GB at 32-bit, 14 GB at 16-bit, and 3.5 GB at 4-bit. Same model, same 7 billion weights. The only thing that changed is how many bits each number gets.
This is why memory, not raw speed, is the gate on local AI. A weight has to sit in RAM to be used, so the quantized size is exactly what decides whether a model fits on your machine at all. It is the same math behind the 128 GB laptops that can now hold a 70B model, and the reason a phone can run a small one. Shrink the numbers and the hardware requirement shrinks with them.
One clarification, because the terms get muddled. Quantization keeps every one of the model’s weights. It only stores each in fewer bits. That makes it different from two other shrinking methods: pruning, which deletes weights the model can spare, and distillation, which trains a smaller student model to imitate a larger one. Those change what the model is. Quantization changes only how precisely it is written down, which is why the same model at 4-bit still behaves like itself.
Four-bit costs almost nothing in quality
The obvious worry: if you round off every number, doesn’t the model get dumber? A little. Far less than intuition suggests. The reason is that a neural network never relied on those trailing digits in the first place. Its answers come from billions of numbers voting together, and the vote barely moves when each vote is rounded slightly.
The measurements bear this out. A 2026 evaluation of llama.cpp quantization on Llama-3.1-8B found the popular 4-bit setting scored 69.15 on an aggregate benchmark against 69.47 for the full-precision model. That is a loss of under half a percent, in exchange for cutting the model to under a third of its size. In blind chat, nobody notices.

Think of it like measuring tools. A caliper reading to a hundredth of a millimeter and a ruler marked in centimeters will both tell you the same table is about a meter wide. The extra precision is real, and for this job it is wasted. Most of a model’s 16-bit precision is detail the task never asks for, which is why throwing most of it away barely registers.
Below four bits, quality falls off a cliff
The catch is that this generosity runs out. Compression is nearly free down to 4 bits, then the losses stop being rounding errors and start being real. The llama.cpp project’s own measurements show the pattern cleanly on a 7B model: the extra perplexity (a standard error measure, where lower is better) is a negligible +0.05 at 4-bit, but jumps to +0.24 at 3-bit and +0.87 at 2-bit.
That is close to an order of magnitude worse at each step below 4-bit. The same evaluation paper watched a 3-bit setting knock the model’s grade-school math score from 77.6 to 68.3, a nine-point drop that would be obvious to any user. The lesson is not “smaller is always fine.” It is that 4-bit sits right at the elbow of the curve: almost all the size savings, almost none of the quality cost.
The floor has been creeping down. A newer generation of 2-bit and 3-bit methods, the ones labeled with an IQ prefix, use an “importance matrix” that studies which weights the model leans on most and protects them, recovering some of the quality the old low-bit builds threw away. They are genuinely better than the naive versions. The cliff is gentler than it was, but it is still a cliff: for everyday use, 4-bit remains the line most people should not cross.
There is a bonus most people miss. A quantized model is usually not just smaller but faster. Generating text is limited less by raw math than by how many bytes the chip has to haul out of memory for each word. A 4-bit model moves roughly a quarter of the bytes a 16-bit one does, so on the same laptop it tends to produce tokens noticeably quicker. You shrink the file and speed up the replies in the same move.
What Q4_K_M actually means
Download a model and you meet a wall of cryptic labels: Q4_K_M, Q5_K_S, Q8_0, IQ2_XXS. They look like part numbers. They are readable once you know the parts.
The number is the headline bit count. Q4 is roughly 4 bits per weight, Q8 is 8, Q2 is 2. The Kmarks a smarter family called k-quants that do not round every weight equally. They spend more bits on the layers that matter most to the output and fewer on the rest, which is why a “4-bit” model averages closer to 4.5. The trailing S, M, or L is small, medium, or large: how generous that mixing is. Q4_K_M is the medium, balanced 4-bit build, and it is the one most people should take.
There is real engineering behind the “spend bits where they matter” idea. The influential QLoRA paper introduced a 4-bit format called NF4, designed to be, in its authors’ words, “information-theoretically optimal” for the way neural network weights actually cluster around zero. It let researchers fine-tune a 65B model on a single 48 GB GPU, and their quantized model reached 99.3% of ChatGPT’s score on a common benchmark. Careful 4-bit is not a hack. It is a well-studied sweet spot.
The one decision: which quant to download
Here is the practical payoff. When a model page offers a dozen quant options, you are really choosing one point on the size-versus-quality curve, and the curve tells you where to stand.
In practice you rarely quantize a model yourself. Someone has already done it and published the results as a file, usually in a container format called GGUF, the standard the popular llama.cpp engine reads. A tool like Ollama or LM Studio downloads one of those files for you. That is what all the labels are: pre-built copies at different bit depths, sitting on a download page, waiting for you to pick a row.
The rule of thumb is short. Take Q4_K_M unless you have a specific reason not to. If you have spare memory and want the safest margin, step up to Q5 or Q6. Only drop to 3-bit or 2-bit when a model would not otherwise fit, and know you are trading real quality to do it. This is also the quiet logic behind most “best local model” roundups: the size they quote is almost always the 4-bit one.

Shrink it, then run it
Quantization is the bridge between the models labs release and the hardware you own. It is why a 141 GB frontier model becomes a 43 GB download that runs on a laptop, and why that laptop’s answers are nearly indistinguishable from the full-precision original. The mechanism is nothing fancier than rounding billions of numbers with care.
Keep three things. A model is a pile of numbers, and quantization stores each in fewer bits. Down to 4 bits the quality cost is a rounding error; below it, the floor drops away. And when you download a model, Q4_K_M is the default for a reason worth trusting. Quantization sits alongside the token as one of the few pieces of plumbing that, once you understand it, makes the whole machine feel less like magic and more like something you can size, budget, and run yourself.


