Meta Llama, explained: which model is for what, and how to run it
It started the open-weights era and still sets the baseline everyone benchmarks against. Now its creator is hinting it might stop. The map, before that happens.
Read any review of an open AI model — Qwen, DeepSeek, Gemma, Mistral — and you’ll find the same yardstick buried in the benchmark tables: how it does against Llama. Meta’s family is the reference point the whole field measures itself against. It is the model that made “download the weights and run them yourself” a normal thing to do.
Here’s the awkward part. On a head-to-head leaderboard today, a same-sized Qwen or DeepSeek often beats Llama outright. And in July 2025, the company that gave the world open weights started hinting it might stop. So the honest question isn’t “is Llama the best open model?” — it usually isn’t anymore. It’s “why is Llama still the answer most teams should pick?” This post is the map: which Llama is for what, how to run one, the license line you can’t cross, and the cloud now hanging over the whole family.
Llama started the open era — and still sets the baseline
To understand why Llama matters more than its benchmark scores suggest, rewind to early 2023. Capable language models were things you rented from OpenAI through an API. Then Meta released Llama 1 to researchers, the weights leaked within a week, and a generation of developers discovered they could run a real model on their own machine. Llama 2 made it official in July 2023 with a commercial license. The open-weights era has a birthday, and that’s it.
Every release since widened the road. Llama 3.1, in July 2024, shipped a 405-billion-parameter model — the first time an open download stood at genuine frontier scale. Llama 3.2 added vision. The sizes most people actually run today — the dense 8B and 70B — come from this 3.x line, and they’re still the workhorses behind a huge share of local AI. When a new lab wants to prove its model is good, it shows you the Llama column. That gravitational pull is the asset, and no benchmark captures it.

A herd of four — and only two of them shipped
The current generation, Llama 4, landed on April 5, 2025 and broke from everything before it. For the first time the models are mixture-of-experts — they hold a large pool of specialist sub-networks but activate only a few per token — and they’re natively multimodal, trained on text and images together rather than bolting vision on later. Meta announced a herd of three, with animal names.
Scout is the lightweight: 17 billion active parameters, 109 billion total across 16 experts, and a headline 10-million-token context window — it’s built to fit on a single H100 GPU. Maverick is the one Meta calls its “product workhorse”: the same 17B active, but 400B total across 128 experts, tuned for general assistant and chat work. Then there’s Behemoth — 288B active, nearly two trillion total — which Meta previewed as “still training” and, as far as the public is concerned, never shipped. It exists mainly as a teacher used to train the smaller two. A herd of four was promised; two arrived.
That gap matters, because Llama 4’s reception was muted. The MoE models landed against a wall of strong, smaller, Apache-licensed competitors, and the launch was dogged by questions about how its leaderboard scores were obtained. If you want the quarter-by-quarter standings, the text-model roundup tracks where Llama sits against the field. Short version: respectable, not dominant.
Its real edge: reach, and a safety stack nobody else ships
So why pick it? Because “best on a benchmark” and “safest to build on” are different questions, and Llama wins the second one decisively. It is the most-supported open family on Earth. Every runtime loads it — Ollama, llama.cpp, vLLM, MLX, Transformers. Every cloud hosts it — Bedrock, Together, Fireworks, Azure, Vertex. The largest universe of fine-tunes, LoRA adapters, and how-to guides is built around it. When something breaks at 2 a.m., the answer is already on a forum. That depth is worth more than two points on a coding benchmark to most teams.
It’s also the broadest official lineup under one roof. Text models from a 1B you can run on a phone up to the 405B frontier build; vision models (the 11B and 90B from Llama 3.2) that read images and documents; and a full safety sub-family on top. You don’t have to mix vendors to cover chat, multimodal, and moderation — they’re all Llama, all sharing one license and one set of prompt formats. For a team that wants one family it understands deeply rather than five it knows shallowly, that coherence is the quiet selling point.

The other genuine differentiator is safety. Meta ships a whole second family of guard models that no rival matches. Llama Guard 4 is a classifier that scores another model’s inputs and outputs against 14 safety categories, across text and images. Prompt Guard 2 — in 86M and 22M sizes — is a tiny model that sniffs for prompt injection and jailbreak attempts before they reach your main model. Code Shield checks generated code for insecure patterns. If you’re wrapping a model in a product, that ready-made moderation layer is a real reason to stay inside the Llama ecosystem even when a competitor’s chat model scores higher.
“Open source” is the wrong words — read the license
Here’s where most write-ups get lazy. They call Llama “open source.” It isn’t — not in the way that phrase means for Linux or, for that matter, for Qwen and Mistral’s genuinely Apache-licensed releases. Llama ships under the Llama Community License, a custom Meta agreement. The Open Source Initiative and the Free Software Foundation have both said plainly that it does not meet the definition. “Source-available” is the accurate term.
- Use it commercially, in production
- Modify and fine-tune the weights
- Self-host it, offline, on your own hardware
- Redistribute it, with the license attached
- Over 700M monthly users? Ask Meta’s permission
- An acceptable-use policy you’re bound to
- You must display “Built with Llama”
- You can’t use its output to train a rival model
For most readers this is fine: you can use Llama commercially, fine-tune it, self-host it, and redistribute it for free. But read the strings. There’s a famous clause: if your product has more than 700 million monthly active users, you must request a separate license that Meta grants “in its sole discretion.” There’s an acceptable-use policy you’re contractually bound to. You must display “Built with Llama.” And you can’t use Llama’s outputs to train a non-Llama model — a clause aimed squarely at competitors distilling its knowledge. None of this blocks a normal business. All of it is the difference between downloading a model and building a company on one. The sibling Gemma write-up walks the same license seam from Google’s side.
Match the model to your machine — and mind the size
This is where the herd’s grandeur becomes a problem. Llama 4 Scout, the “small” one, is a 67 GB download that wants a data-center GPU. Even the cleverness of mixture-of-experts doesn’t shrink the memory you need to hold all those weights. For anyone running on a laptop, the practical Llama is still the dense 3.x line — and that’s not a knock, it’s exactly why those models stay at the top of every download chart.
The decision is mostly a memory question. On a normal 8–16 GB laptop, run llama3.1:8b — it installs in one command and answers offline forever after. With a 24 GB GPU or a loaded Mac, the 70B Llama 3.3 at 4-bit is a genuinely strong local assistant. The full Llama 4 herd is for servers and serious GPUs. If you want this reasoned across every open family at once, the local-model flowchart maps RAM tiers to picks, and the open-models field guide ranks the alternatives Llama now competes with.
The real question: will there be a Llama 5?
Every other section of this post would be enough for a tidy explainer. This one is why it had to be written now. On July 30, 2025, Mark Zuckerberg published an essay, “Personal Superintelligence,” that quietly reversed the argument he’d made for years. Where he once said open models were safer, he now wrote: “we’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” For a company whose entire AI brand was openness, that sentence is a swerve.

The context makes the swerve legible. Meta spun up a new Superintelligence Labs unit and went on a hiring spree across the industry. Its capital spending for 2026 is guided at $115–135 billion. A model built with that kind of money, handed free to rivals, no longer explains itself to a finance department. Behemoth never shipped; the next flagship has no confirmed date and no promise it’ll be open. The pattern — beloved AI product, strategic pivot, doors quietly closing — is the one the “AI products are mortal” argument warns about, and it now applies to the family that defined the category.
So, should you build on Llama?
Yes — with your eyes open. For most teams the choice is still easy: if you don’t know which open model to pick, Llama is the low-risk default, because the ecosystem, the tooling, the fine-tunes, and the safety stack outweigh a couple of benchmark points. Run llama3.1:8b on a laptop, step up to 3.3 70B on a workstation, and reach for the Llama 4 herd only when you have the GPUs for it. Add Llama Guard and Prompt Guard if you’re shipping to users. Read the license before you scale.
But notice what carried the weight in that recommendation: none of it was “because it’s the best model.” It was ecosystem and safety-of-default — and ecosystems can be left to age. The smartest move is the one that holds whichever way Meta jumps: own the weights you run, keep your workflow on a model you’ve already downloaded, and treat any single vendor’s roadmap as theirs to change. Llama started the open era and handed you the tools to outlast its own second thoughts. Use them.


