Local AIOpen SourceTutorialJune 16, 202610 min read

Google Gemma, explained: which model is for what

Type “gemma” into your runtime and you get a wall of models: a phone-sized one, a workstation one, and odd cousins. Here’s which one you actually want.

One family, every machine

Apache 2.0

There isn’t one Gemma. There’s one for every machine you own.

In your pocket · Phone · IoT

Gemma 4 E2B / E4B

Runs in 2–3 GB. Hears audio, reads images, fully offline.

On your laptop · 8–16 GB RAM

Gemma 4 12B

The daily workhorse: 256K context, vision, reasoning.

On a workstation · 32 GB+ / GPU

Gemma 4 31B

A top-3 open model on the public arena. The family ceiling.

Same architecture, scaled from a 2 GB phone footprint to a 31B workstation model, and, since April 2026, all of it under a plain open-source license.

Open your model runtime (Ollama, LM Studio, whatever you use) and type “gemma.” You will not get one result. You will get a wall of them: a 270-million-parameter model that fits on a phone, a 31-billion one that wants a workstation, and a scatter of oddly named cousins: EmbeddingGemma, ShieldGemma, MedGemma, even one trained on dolphin sounds. People keep telling you Gemma is good. Nobody tells you which Gemma.

This is the map. Gemma is Google’s family of open-weight models, built from the same research as its closed Gemini models and handed to you as files you can download and run with the Wi-Fi off. The reason it’s worth a whole post (rather than a line in a roundup) is that it isn’t one model. It spans from a 2 GB phone footprint to a model that ranks in the top three open models in the world, plus a set of specialists each built for a single job. By the end you’ll know which one to run for your task and your hardware, and have it running in front of you.

There isn’t one Gemma

Start with the shape of the family, because the names hide it. There are two layers. The core models are the general-purpose chat-and-reasoning models (the ones you’d actually talk to) and they come in a ladder of sizes so that one of them fits whatever machine you own. The latest generation, Gemma 4, landed on April 2, 2026 in five sizes that run from a phone to a desktop.

The second layer is the specialists: models that take a core Gemma and bend it toward one task. EmbeddingGemma turns documents into something searchable. ShieldGemma scores content for safety. MedGemma reads medical scans. They are not chatbots, and using them as one is the most common beginner mistake. Get the two layers straight and the wall of names resolves into a simple question: are you talking to the model, or putting it to work on one narrow job?

A hand holding a modern smartphone. — The smallest Gemma 4 runs in about 2 GB, small enough to live on the phone in your hand, hearing audio and reading images with no server in the loop. Photo by Nick Nice on Unsplash.

Gemma 4 dropped the custom license

The biggest thing about Gemma 4 isn’t a benchmark. It’s the license. For every version up to and including Gemma 3, the weights shipped under the Gemma Terms of Use: a custom Google license that allowed commercial use but tied you to a prohibited-use policy the company could revise and enforce. It was more generous than most people assumed, and less clean than a real open-source license. You needed a lawyer to be sure.

Gemma 4 ships under plain Apache 2.0: the same permissive terms as Qwen and Mistral. No monthly-user cap, no acceptable-use policy the vendor polices, no bespoke clauses to read. Run it, fine-tune it, embed it in a product, sell that product. For anyone who was nervous about building a business on a custom license, that one word closes the gap.

Through Gemma 3

Gemma Terms of Use

A custom Google license. Commercial use allowed, but bound to a prohibited-use policy Google could update and enforce. Lawyers had to read it.

Gemma 4 onward

Apache 2.0

The same permissive terms as Qwen and Mistral. No user cap, no acceptable-use policy the vendor enforces, no custom clauses to interpret. Run it, change it, sell it.

One caveat keeps this honest: only the Gemma 4 core moved. The specialists in the next section are still built on Gemma 3 and still carry the older terms. The headline is real (Gemma’s flagship is now genuinely open-source), but “Gemma is Apache now” is a half-truth until the specialists catch up. The difference between downloading a model and shipping a product on it is exactly the gap the open-models field guide warns about, and it still applies here.

The core models: phone to workstation

Here is the spine of the whole family. Five core sizes, smallest to largest, with the one fact that decides each pick: how much memory it needs to load.

The Gemma 4 core, smallest to largest

Model

Architecture

Runs in

Inputs

Ctx

Best for

E2B

~2B effective (5B raw)

~2 GB

text · image · audio · video

128K

Phones, IoT, always-on assistants

E4B

~4B effective (8B raw)

~3 GB

text · image · audio · video

128K

Best on-device model; laptops offline

12B

12B dense

~8 GB (Q4)

text · image · video

256K

The daily laptop workhorse

26B

26B MoE (~4B active)

~16 GB (Q4)

text · image · video

256K

Big-model quality at small-model speed

31B

31B dense

~20 GB (Q4)

text · image · video

256K

Top-3 open model; the ceiling

“Effective” parameters: the edge models carry more weights than they ever load at once, so an 8B model fits a 3 GB budget. Q4 footprints are approximate; native audio input is strongest on E2B and E4B.

The two E-models (E2B and E4B) are the on-device tier, and they’re the cleverest engineering in the lineup. A trick called Per-Layer Embeddings lets an 8-billion-parameter model keep most of its weights in slow storage and load only what each layer needs, so E4B runs in the memory budget of a 4B model. They’re the only Gemmas that take audio input natively: speech recognition and translation, on the phone, offline. That’s the descendant of the Gemma 3n line that started the on-device push in 2025.

The 12B is the one most laptop users should reach for: full vision, a 256K context window, real reasoning, and an ~8 GB footprint at 4-bit. The 26B is a mixture-of-experts build: 26B of weights but only ~4B active per token, so it answers at the speed of a much smaller model. And the 31B is the ceiling: Google places it among the top three open models on the public LMArena leaderboard, and claims it outcompetes models twenty times its size. All of them read images and video; this is multimodal by default, not as an add-on.

A laptop on a desk showing lines of code. — The 12B is the sweet spot for a working laptop: frontier-adjacent, fully local, and small enough to leave room for everything else you have open. Photo by Christopher Gower on Unsplash.

The specialized Gemmas, each for one job

This is where Gemma is genuinely different from its peers. Qwen and Mistral give you excellent general models; Google ships a small fleet of official specialists under one consistent family. Each takes a Gemma base and tunes it hard for a single task, and using the wrong one as a chatbot will only disappoint you.

Six Gemmas that do one job each

EmbeddingGemma

308M

built on Gemma 3

Turns your files into searchable vectors: the quiet engine behind local “chat with your documents.” Runs in ~300 MB.

CodeGemma

2B · 7B

built on Gemma

In-editor code completion and fill-in-the-middle. The oldest of the bunch, but still the lightweight autocomplete pick.

PaliGemma 2

3B · 10B · 28B

built on Gemma 2

Vision-language specialist: captioning, OCR, document Q&A, and fine-tunable for one narrow seeing task.

ShieldGemma 2

built on Gemma 3

A safety classifier, not a chatbot. Scores another model’s image inputs and outputs against a policy you define.

MedGemma 1.5

4B · 27B

built on Gemma 3

Medical text and imaging: radiology, pathology, clinical notes. A research tool, explicitly not a medical device.

FunctionGemma

270M

built on Gemma 3

A tiny tool-router: takes a request, emits the function call to make. Cheap enough to sit in front of bigger models.

These still ride Gemma 3 (or earlier) and carry the older Gemma Terms of Use, not the Apache 2.0 license of the Gemma 4 core. Read the card before you ship.

Two are worth singling out for builders. EmbeddingGemma is a 308M model that does no chatting at all: it converts text into vectors, the unglamorous step that makes “search your own files” work offline. It tops the public embedding benchmark among models under 500M and runs in about 300 MB of RAM, which makes private, on-device retrieval practical on a phone. ShieldGemma is the moderator: point it at another model’s inputs and outputs and it flags policy violations, so your safety layer can be local too.

The roster keeps widening. There’s TranslateGemma for translation across 140-plus languages, FunctionGemma as a 270M tool-router that decides which function to call, and (the one that makes the point that this is a research family, not a product line) DolphinGemma, a ~400M model trained with marine biologists to model dolphin vocalizations. You will not run that one. But it tells you how broad the bet is.

Match the model to your memory

The pick is almost entirely a memory question. A model needs roughly half a gigabyte per billion parameters at the common 4-bit quantization, so the size that fits is the size your RAM allows, not the one at the top of a leaderboard. Four tiers cover most people.

On 8 GB (any modern laptop) run E4B; it’s the best thing that fits in 3 GB and leaves room to work. On 16 GB, step up to the 12B at 4-bit and you have a genuinely capable daily model. At 32 GB, the 26B mixture-of-experts is the smart choice: big-model answers at small-model speed. And at 48 GB or more (a loaded Mac or a desktop with a 24 GB GPU) the 31B breathes. If you want this reasoned out across every open family, not just Gemma, the local-model flowchart maps RAM tiers to picks square by square.

From zero to a running Gemma, four ways

# Ollama, easiest:

ollama run gemma4:12b # ~8 GB download, then chat offline

# Smaller machine? Drop to the edge model:

ollama run gemma4:e4b # runs in ~3 GB

Prefer a window? LM Studio gives you a model browser and a download button. On Apple Silicon, MLX is ~10–20% faster than Ollama, once the launch-week bugs settle. Don’t want local at all? The same weights are a click away in Google AI Studio and on Vertex AI.

The fastest path is Ollama: install it, run one line, and the model downloads once and answers offline forever after. LM Studio wraps the same thing in a window with a download button. On Apple Silicon, the MLX runtime squeezes out another 10–20% but tends to lag a launch by a few buggy weeks. And if you’d rather not run anything locally, the identical weights are hosted in Google AI Studio and on Vertex AI: the open license means the model is the same wherever it runs.

A desktop computer setup with a large monitor on a desk. — The 31B wants 20 GB of memory free: comfortable on a loaded desktop, tight on a thin-and-light. Your RAM, not the benchmark, picks the model. Photo by Jaime Marrero on Unsplash.

Where Gemma isn’t the answer

A single-family guide that pretended Gemma always wins would be useless, so here are the honest edges. The hardest frontier work still belongs to the big closed models and the trillion-parameter open builds that need a server, not a laptop. No 31B local model closes that gap. On a given benchmark, a same-size Qwen or Mistral may edge Gemma out; if the task is narrow and the score matters, test both rather than trusting the family name.

And mind the license seam. The Apache 2.0 headline is true for the Gemma 4 core and only the core. The specialists you might actually deploy (MedGemma in a clinic tool, ShieldGemma in a moderation pipeline) are still Gemma 3 underneath, still under the older terms, and MedGemma adds its own “not a medical device” line you cannot wave away. Owning the weights is half the case for running AI on your own machine; the license is the other half, and it isn’t uniform across this family yet.

Pick one and run it tonight

Strip away the dozen names and the decision is small. Talking to a model? Use the core: E4B on a phone or thin laptop, 12B on a normal one, 26B or 31B if your memory allows. Putting a model to work on one job? Reach for the specialist (EmbeddingGemma to search your files, ShieldGemma to moderate, MedGemma or PaliGemma for their domains) and read its card, because the open license stops at the core.

Then stop reading and run one. Install Ollama, type ollama run gemma4:12b, and in the time it takes to make coffee you’ll have a top-tier open model answering on your own machine, no key, no bill, no upload. The roster will keep growing; the two-layer map (core to talk to, specialist to deploy) won’t. That’s the whole of Gemma, and now it’s genuinely yours to keep.

Disclaimer: The license descriptions in this post are plain-English summaries, not legal advice. Model licenses change between releases, and what counts as permitted use depends on your situation. Read the license text that ships with the model, and consult counsel before relying on it in a product. Details reflect sources available as of June 2026.

Google Gemma, explained: which model is for what

There isn’t one Gemma

Gemma 4 dropped the custom license

The core models: phone to workstation

The specialized Gemmas, each for one job

Match the model to your memory

Where Gemma isn’t the answer

Pick one and run it tonight

Text-to-video in 2026: what a sentence gets you now

AI voiceovers without a studio: podcasts, videos, and audiobooks

Which AI should I use? A plain guide to picking one

One-time payment. Yours forever.