Local AIOpen SourceTutorialJune 24, 202610 min read

Meta Llama, explained: which model is for what, and how to run it

It started the open-weights era and still sets the baseline everyone benchmarks against. Now its creator is hinting it might stop. The map, before that happens.

By Atul

The open-weights era, by Llama release

2023 → ?

Llama didn’t join the open era. It started it — and may end it.

Feb 2023

Llama 1

Weights leak, research-only. The era starts by accident.

Jul 2023

Llama 2

First commercial license. Open weights go mainstream.

Jul 2024

Llama 3.1 405B

The first open model at frontier scale.

Apr 2025

Llama 4

The herd goes mixture-of-experts. Reception: mixed.

Jul 2025

The essay

“Careful about what we choose to open source.”

2026

Llama 5?

No date. No confirmation it’s open at all.

Every open model you’ve read about benchmarks itself against this family. The last two milestones are the reason this post needed writing now.

Read any review of an open AI model — Qwen, DeepSeek, Gemma, Mistral — and you’ll find the same yardstick buried in the benchmark tables: how it does against Llama. Meta’s family is the reference point the whole field measures itself against. It is the model that made “download the weights and run them yourself” a normal thing to do.

Here’s the awkward part. On a head-to-head leaderboard today, a same-sized Qwen or DeepSeek often beats Llama outright. And in July 2025, the company that gave the world open weights started hinting it might stop. So the honest question isn’t “is Llama the best open model?” — it usually isn’t anymore. It’s “why is Llama still the answer most teams should pick?” This post is the map: which Llama is for what, how to run one, the license line you can’t cross, and the cloud now hanging over the whole family.

Llama started the open era — and still sets the baseline

To understand why Llama matters more than its benchmark scores suggest, rewind to early 2023. Capable language models were things you rented from OpenAI through an API. Then Meta released Llama 1 to researchers, the weights leaked within a week, and a generation of developers discovered they could run a real model on their own machine. Llama 2 made it official in July 2023 with a commercial license. The open-weights era has a birthday, and that’s it.

Every release since widened the road. Llama 3.1, in July 2024, shipped a 405-billion-parameter model — the first time an open download stood at genuine frontier scale. Llama 3.2 added vision. The sizes most people actually run today — the dense 8B and 70B — come from this 3.x line, and they’re still the workhorses behind a huge share of local AI. When a new lab wants to prove its model is good, it shows you the Llama column. That gravitational pull is the asset, and no benchmark captures it.

Two brown llamas standing in a field. — The name stuck because the thing it named became the default. Photo by Josiah Farrow on Unsplash.

A herd of four — and only two of them shipped

The current generation, Llama 4, landed on April 5, 2025 and broke from everything before it. For the first time the models are mixture-of-experts — they hold a large pool of specialist sub-networks but activate only a few per token — and they’re natively multimodal, trained on text and images together rather than bolting vision on later. Meta announced a herd of three, with animal names.

The Llama 4 herd, April 2025

Model

Active

Total · experts

Ctx

Status

Scout

17B active

109B · 16 experts

10M

Shipped — fits one H100

Maverick

17B active

400B · 128 experts

Shipped — the workhorse

Behemoth

288B active

~2T · 16 experts

—

Never shipped — teacher only

Mixture-of-experts: a model holds many “experts” but fires only a slice per token, so Maverick runs at 17B speed while carrying 400B of weights. Behemoth was previewed “still training” and was last reported shelved.

Scout is the lightweight: 17 billion active parameters, 109 billion total across 16 experts, and a headline 10-million-token context window — it’s built to fit on a single H100 GPU. Maverick is the one Meta calls its “product workhorse”: the same 17B active, but 400B total across 128 experts, tuned for general assistant and chat work. Then there’s Behemoth — 288B active, nearly two trillion total — which Meta previewed as “still training” and, as far as the public is concerned, never shipped. It exists mainly as a teacher used to train the smaller two. A herd of four was promised; two arrived.

That gap matters, because Llama 4’s reception was muted. The MoE models landed against a wall of strong, smaller, Apache-licensed competitors, and the launch was dogged by questions about how its leaderboard scores were obtained. If you want the quarter-by-quarter standings, the text-model roundup tracks where Llama sits against the field. Short version: respectable, not dominant.

Its real edge: reach, and a safety stack nobody else ships

So why pick it? Because “best on a benchmark” and “safest to build on” are different questions, and Llama wins the second one decisively. It is the most-supported open family on Earth. Every runtime loads it — Ollama, llama.cpp, vLLM, MLX, Transformers. Every cloud hosts it — Bedrock, Together, Fireworks, Azure, Vertex. The largest universe of fine-tunes, LoRA adapters, and how-to guides is built around it. When something breaks at 2 a.m., the answer is already on a forum. That depth is worth more than two points on a coding benchmark to most teams.

It’s also the broadest official lineup under one roof. Text models from a 1B you can run on a phone up to the 405B frontier build; vision models (the 11B and 90B from Llama 3.2) that read images and documents; and a full safety sub-family on top. You don’t have to mix vendors to cover chat, multimodal, and moderation — they’re all Llama, all sharing one license and one set of prompt formats. For a team that wants one family it understands deeply rather than five it knows shallowly, that coherence is the quiet selling point.

A dense wall of network cables and servers in a data center. — Llama is the model already wired into every runtime and every cloud — the boring infrastructure advantage benchmarks never show. Photo by Taylor Vick on Unsplash.

The other genuine differentiator is safety. Meta ships a whole second family of guard models that no rival matches. Llama Guard 4 is a classifier that scores another model’s inputs and outputs against 14 safety categories, across text and images. Prompt Guard 2 — in 86M and 22M sizes — is a tiny model that sniffs for prompt injection and jailbreak attempts before they reach your main model. Code Shield checks generated code for insecure patterns. If you’re wrapping a model in a product, that ready-made moderation layer is a real reason to stay inside the Llama ecosystem even when a competitor’s chat model scores higher.

“Open source” is the wrong words — read the license

Here’s where most write-ups get lazy. They call Llama “open source.” It isn’t — not in the way that phrase means for Linux or, for that matter, for Qwen and Mistral’s genuinely Apache-licensed releases. Llama ships under the Llama Community License, a custom Meta agreement. The Open Source Initiative and the Free Software Foundation have both said plainly that it does not meet the definition. “Source-available” is the accurate term.

What you can do, free

Use it commercially, in production
Modify and fine-tune the weights
Self-host it, offline, on your own hardware
Redistribute it, with the license attached

The strings attached

Over 700M monthly users? Ask Meta’s permission
An acceptable-use policy you’re bound to
You must display “Built with Llama”
You can’t use its output to train a rival model

For most readers this is fine: you can use Llama commercially, fine-tune it, self-host it, and redistribute it for free. But read the strings. There’s a famous clause: if your product has more than 700 million monthly active users, you must request a separate license that Meta grants “in its sole discretion.” There’s an acceptable-use policy you’re contractually bound to. You must display “Built with Llama.” And you can’t use Llama’s outputs to train a non-Llama model — a clause aimed squarely at competitors distilling its knowledge. None of this blocks a normal business. All of it is the difference between downloading a model and building a company on one. The sibling Gemma write-up walks the same license seam from Google’s side.

Match the model to your machine — and mind the size

This is where the herd’s grandeur becomes a problem. Llama 4 Scout, the “small” one, is a 67 GB download that wants a data-center GPU. Even the cleverness of mixture-of-experts doesn’t shrink the memory you need to hold all those weights. For anyone running on a laptop, the practical Llama is still the dense 3.x line — and that’s not a knock, it’s exactly why those models stay at the top of every download chart.

What actually fits on what you own

# On a laptop:

ollama run llama3.1:8b # ~5 GB, runs on 8 GB of RAM

# On a workstation / big GPU:

ollama run llama3.3:70b # ~40 GB at 4-bit

# The Llama 4 herd — server territory:

ollama run llama4:scout # 67 GB download

The headline models are mixture-of-experts giants. For a laptop you still reach for the dense Llama 3.x models — which is why they remain the most-downloaded Llamas on every runtime.

The decision is mostly a memory question. On a normal 8–16 GB laptop, run llama3.1:8b — it installs in one command and answers offline forever after. With a 24 GB GPU or a loaded Mac, the 70B Llama 3.3 at 4-bit is a genuinely strong local assistant. The full Llama 4 herd is for servers and serious GPUs. If you want this reasoned across every open family at once, the local-model flowchart maps RAM tiers to picks, and the open-models field guide ranks the alternatives Llama now competes with.

The real question: will there be a Llama 5?

Every other section of this post would be enough for a tidy explainer. This one is why it had to be written now. On July 30, 2025, Mark Zuckerberg published an essay, “Personal Superintelligence,” that quietly reversed the argument he’d made for years. Where he once said open models were safer, he now wrote: “we’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” For a company whose entire AI brand was openness, that sentence is a swerve.

A path through a forest splitting into two directions. — The fork in Meta’s strategy: keep handing frontier models to competitors for free, or keep the best ones in-house. Photo by Jens Lelie on Unsplash.

The context makes the swerve legible. Meta spun up a new Superintelligence Labs unit and went on a hiring spree across the industry. Its capital spending for 2026 is guided at $115–135 billion. A model built with that kind of money, handed free to rivals, no longer explains itself to a finance department. Behemoth never shipped; the next flagship has no confirmed date and no promise it’ll be open. The pattern — beloved AI product, strategic pivot, doors quietly closing — is the one the “AI products are mortal” argument warns about, and it now applies to the family that defined the category.

So, should you build on Llama?

Yes — with your eyes open. For most teams the choice is still easy: if you don’t know which open model to pick, Llama is the low-risk default, because the ecosystem, the tooling, the fine-tunes, and the safety stack outweigh a couple of benchmark points. Run llama3.1:8b on a laptop, step up to 3.3 70B on a workstation, and reach for the Llama 4 herd only when you have the GPUs for it. Add Llama Guard and Prompt Guard if you’re shipping to users. Read the license before you scale.

But notice what carried the weight in that recommendation: none of it was “because it’s the best model.” It was ecosystem and safety-of-default — and ecosystems can be left to age. The smartest move is the one that holds whichever way Meta jumps: own the weights you run, keep your workflow on a model you’ve already downloaded, and treat any single vendor’s roadmap as theirs to change. Llama started the open era and handed you the tools to outlast its own second thoughts. Use them.

Meta Llama, explained: which model is for what, and how to run it

Llama started the open era — and still sets the baseline

A herd of four — and only two of them shipped

Its real edge: reach, and a safety stack nobody else ships

“Open source” is the wrong words — read the license

Match the model to your machine — and mind the size

The real question: will there be a Llama 5?

So, should you build on Llama?

What is an embedding? How AI turns meaning into numbers

What is prompt injection? The flaw every AI agent ships with

What is MCP? The standard that lets AI actually do things

One-time payment. Yours forever.