Google Gemma, explained: which model is for what
Type “gemma” into your runtime and you get a wall of models — a phone-sized one, a workstation one, and odd cousins. Here’s which one you actually want.
Open your model runtime — Ollama, LM Studio, whatever you use — and type “gemma.” You will not get one result. You will get a wall of them: a 270-million-parameter model that fits on a phone, a 31-billion one that wants a workstation, and a scatter of oddly named cousins — EmbeddingGemma, ShieldGemma, MedGemma, even one trained on dolphin sounds. People keep telling you Gemma is good. Nobody tells you which Gemma.
This is the map. Gemma is Google’s family of open-weight models, built from the same research as its closed Gemini models and handed to you as files you can download and run with the Wi-Fi off. The reason it’s worth a whole post — rather than a line in a roundup — is that it isn’t one model. It spans from a 2 GB phone footprint to a model that ranks in the top three open models in the world, plus a set of specialists each built for a single job. By the end you’ll know which one to run for your task and your hardware, and have it running in front of you.
There isn’t one Gemma
Start with the shape of the family, because the names hide it. There are two layers. The core models are the general-purpose chat-and-reasoning models — the ones you’d actually talk to — and they come in a ladder of sizes so that one of them fits whatever machine you own. The latest generation, Gemma 4, landed on April 2, 2026 in five sizes that run from a phone to a desktop.
The second layer is the specialists: models that take a core Gemma and bend it toward one task. EmbeddingGemma turns documents into something searchable. ShieldGemma scores content for safety. MedGemma reads medical scans. They are not chatbots, and using them as one is the most common beginner mistake. Get the two layers straight and the wall of names resolves into a simple question: are you talking to the model, or putting it to work on one narrow job?

Gemma 4 dropped the custom license
The biggest thing about Gemma 4 isn’t a benchmark. It’s the license. For every version up to and including Gemma 3, the weights shipped under the Gemma Terms of Use — a custom Google license that allowed commercial use but tied you to a prohibited-use policy the company could revise and enforce. It was more generous than most people assumed, and less clean than a real open-source license. You needed a lawyer to be sure.
Gemma 4 ships under plain Apache 2.0 — the same permissive terms as Qwen and Mistral. No monthly-user cap, no acceptable-use policy the vendor polices, no bespoke clauses to read. Run it, fine-tune it, embed it in a product, sell that product. For anyone who was nervous about building a business on a custom license, that one word closes the gap.
One caveat keeps this honest: only the Gemma 4 core moved. The specialists in the next section are still built on Gemma 3 and still carry the older terms. The headline is real — Gemma’s flagship is now genuinely open-source — but “Gemma is Apache now” is a half-truth until the specialists catch up. The difference between downloading a model and shipping a product on it is exactly the gap the open-models field guide warns about, and it still applies here.
The core models: phone to workstation
Here is the spine of the whole family. Five core sizes, smallest to largest, with the one fact that decides each pick: how much memory it needs to load.
The two E-models — E2B and E4B — are the on-device tier, and they’re the cleverest engineering in the lineup. A trick called Per-Layer Embeddings lets an 8-billion-parameter model keep most of its weights in slow storage and load only what each layer needs, so E4B runs in the memory budget of a 4B model. They’re the only Gemmas that take audio input natively — speech recognition and translation, on the phone, offline. That’s the descendant of the Gemma 3n line that started the on-device push in 2025.
The 12B is the one most laptop users should reach for: full vision, a 256K context window, real reasoning, and an ~8 GB footprint at 4-bit. The 26B is a mixture-of-experts build — 26B of weights but only ~4B active per token, so it answers at the speed of a much smaller model. And the 31B is the ceiling: Google places it among the top three open models on the public LMArena leaderboard, and claims it outcompetes models twenty times its size. All of them read images and video; this is multimodal by default, not as an add-on.

The specialized Gemmas, each for one job
This is where Gemma is genuinely different from its peers. Qwen and Mistral give you excellent general models; Google ships a small fleet of official specialists under one consistent family. Each takes a Gemma base and tunes it hard for a single task — and using the wrong one as a chatbot will only disappoint you.
Two are worth singling out for builders. EmbeddingGemma is a 308M model that does no chatting at all — it converts text into vectors, the unglamorous step that makes “search your own files” work offline. It tops the public embedding benchmark among models under 500M and runs in about 300 MB of RAM, which makes private, on-device retrieval practical on a phone. ShieldGemma is the moderator: point it at another model’s inputs and outputs and it flags policy violations, so your safety layer can be local too.
The roster keeps widening. There’s TranslateGemma for translation across 140-plus languages, FunctionGemma as a 270M tool-router that decides which function to call, and — the one that makes the point that this is a research family, not a product line — DolphinGemma, a ~400M model trained with marine biologists to model dolphin vocalizations. You will not run that one. But it tells you how broad the bet is.
Match the model to your memory
The pick is almost entirely a memory question. A model needs roughly half a gigabyte per billion parameters at the common 4-bit quantization, so the size that fits is the size your RAM allows — not the one at the top of a leaderboard. Four tiers cover most people.
On 8 GB — any modern laptop — run E4B; it’s the best thing that fits in 3 GB and leaves room to work. On 16 GB, step up to the 12B at 4-bit and you have a genuinely capable daily model. At 32 GB, the 26B mixture-of-experts is the smart choice: big-model answers at small-model speed. And at 48 GB or more — a loaded Mac or a desktop with a 24 GB GPU — the 31B breathes. If you want this reasoned out across every open family, not just Gemma, the local-model flowchart maps RAM tiers to picks square by square.
The fastest path is Ollama: install it, run one line, and the model downloads once and answers offline forever after. LM Studio wraps the same thing in a window with a download button. On Apple Silicon, the MLX runtime squeezes out another 10–20% but tends to lag a launch by a few buggy weeks. And if you’d rather not run anything locally, the identical weights are hosted in Google AI Studio and on Vertex AI — the open license means the model is the same wherever it runs.

Where Gemma isn’t the answer
A single-family guide that pretended Gemma always wins would be useless, so here are the honest edges. The hardest frontier work still belongs to the big closed models and the trillion-parameter open builds that need a server, not a laptop — no 31B local model closes that gap. On a given benchmark, a same-size Qwen or Mistral may edge Gemma out; if the task is narrow and the score matters, test both rather than trusting the family name.
And mind the license seam. The Apache 2.0 headline is true for the Gemma 4 core and only the core. The specialists you might actually deploy — MedGemma in a clinic tool, ShieldGemma in a moderation pipeline — are still Gemma 3 underneath, still under the older terms, and MedGemma adds its own “not a medical device” line you cannot wave away. Owning the weights is half the case for running AI on your own machine; the license is the other half, and it isn’t uniform across this family yet.
Pick one and run it tonight
Strip away the dozen names and the decision is small. Talking to a model? Use the core: E4B on a phone or thin laptop, 12B on a normal one, 26B or 31B if your memory allows. Putting a model to work on one job? Reach for the specialist — EmbeddingGemma to search your files, ShieldGemma to moderate, MedGemma or PaliGemma for their domains — and read its card, because the open license stops at the core.
Then stop reading and run one. Install Ollama, type ollama run gemma4:12b, and in the time it takes to make coffee you’ll have a top-tier open model answering on your own machine, no key, no bill, no upload. The roster will keep growing; the two-layer map — core to talk to, specialist to deploy — won’t. That’s the whole of Gemma, and now it’s genuinely yours to keep.


