Local AIOpen SourceTutorialJune 25, 202610 min read

Run Qwen locally: one open family for chat, code, vision, and audio

Most labs hand you one good model. Alibaba's Qwen hands you a toolbox — chat, code, vision, audio, search — and almost all of it is a free download.

By Atul

One family. Every job. Mostly one command.

Apache 2.0

Most labs give you one good model. Qwen gives you a whole toolbox.

$ ollama run qwen3# chat, reasoning, tool calls

$ ollama run qwen3-coder# write & fix code

$ ollama run qwen3-vl# read images, screenshots, docs

$ ollama run qwen3-embedding# power search & RAG

# audio → Qwen3-Omni, via Transformers / vLLM# speech in, speech out

Chat, code, vision, and search are each a single install. Speech is the one that still wants a Python runtime — the only seam in an otherwise seamless family.

Pick almost any other open-weight AI lab and you get one thing done well. Meta gives you a strong chat model. DeepSeek gives you a reasoner. Mistral gives you an efficient European workhorse. Alibaba’s Qwen gives you a chat model, a coding model, a vision model, an audio model, and an embedding model — all under one name, one prompt format, and one mostly-permissive license. It is less a model than a toolbox.

That breadth is the case for learning Qwen before any other open family: cover five jobs without mixing five vendors. The catch is that “Qwen” names dozens of models, and typing it into your runtime returns a wall of them. This post is the map — which Qwen variant does which job, how to run each on the machine you already own, the one license line to watch, and the two places the family honestly loses.

Qwen quietly became the most complete open family

Qwen started in 2023 as Alibaba Cloud’s in-house model and spent two years climbing. By 2026 it had become, by several measures, the center of gravity in open AI. The consumer Qwen app passed 234 million users by May 2026, and the open weights spawned over 200,000 community variants on Hugging Face — the largest derivative ecosystem of any non-Llama family.

Numbers aside, the reason to care is coverage. When Alibaba released Qwen3, the base family alone spanned six dense sizes — 0.6B, 1.7B, 4B, 8B, 14B, 32B — plus two mixture-of-experts models, and it spoke 119 languages and dialects. Then the specialists arrived: a coder, a vision model, an omni model that hears and speaks, and an embedding model for search. No other open lab ships that full a set under one roof. If you only have room in your head for one open family, this is the one that pays back the most.

A laptop on a desk displaying lines of programming code. — The whole toolbox installs on the machine in front of you — no API key, no per-token meter. Photo by Arnold Francisca on Unsplash.

One family covers five jobs other labs split up

Here is the whole point of Qwen in a single table. Each row is a different job; each variant is a member of the same family, so they share a prompt style and behave consistently. Cover these five with any other vendor and you’re gluing together four separate model lines.

One family, mapped to five jobs

Job

Variant

Sizes

Best for

Chat & reasoning

Qwen3 (dense + MoE)

0.6B → 32B dense · 30B-A3B, 235B-A22B MoE

General assistant, planning, the everyday workhorse

Code

Qwen3-Coder

30B-A3B · 480B-A35B

Agentic coding, refactors, repo-scale edits

Vision

Qwen3-VL

2B → 235B

Screenshots, documents, OCR, charts, video

Audio & speech

Qwen3-Omni

30B-A3B

Transcription, voice chat, speech generation

Search & RAG

Qwen3-Embedding

0.6B · 4B · 8B

Vector search, reranking, retrieval pipelines

Every row is the same family, the same prompt format, the same license. Covering these five with anyone else means stitching together four vendors.

Start with chat. The base Qwen3 models are the everyday workhorse — and they ship a clever trick: a hybrid “thinking” switch. Add /think to a prompt and the model reasons step by step before answering; add /no_think and it replies fast. One model, two speeds, no model-swap. The flagship 235B-A22B trades blows with DeepSeek-R1 and Gemini 2.5 Pro on Alibaba’s own benchmark tables — though see the case for writing your own eval before trusting any vendor’s scoreboard.

Code is where Qwen made its loudest claim. Qwen3-Coder was trained on 7.5 trillion tokens, 70% of them code, and its top 480B-A35B build set state-of-the-art results among open models on agentic coding — Alibaba puts it level with Claude Sonnet 4 on tool-use benchmarks. It handles 256K tokens of context natively, enough to hold a real repository in view. There’s also a 30B version that fits a laptop, which matters more for most readers than the giant. See the AI coding tool map for where a local coder sits next to Cursor and Claude Code.

Most of it is Apache 2.0 — and here’s the catch

License is where the open-model world hides its asterisks, so be precise. The open-weight Qwen models — the entire Qwen3 base family, plus Coder, VL, Omni, and Embedding — ship under Apache 2.0. That is the real thing: no 700-million-user ceiling like Llama’s Community License, no “Built with” attribution requirement, no acceptable-use contract bolted on. Download it, fine-tune it, ship it in a paid product, redistribute it — for free.

Open weights · Apache 2.0

Qwen3 dense (0.6B–32B) and MoE (30B / 235B)
Qwen3-Coder, Qwen3-VL, Qwen3-Omni, Qwen3-Embedding
Use commercially, fine-tune, self-host, redistribute
No user cap, no attribution clause, no usage policy

The catch: the flagship tier

Qwen-Max / Qwen3-Max / the “-Plus” builds
Hosted API only — weights are not released
You rent it; you can’t download or self-host it
Not what this post is about — skip it for local work

The catch is a tier you can’t run at all. Alibaba keeps its very top models — the Qwen-Max and “-Plus” flagships — as a hosted API, with the weights never released. They’re proprietary, and renting them is the opposite of the independence this post is about. The rule of thumb: if a Qwen model has a parameter count and a Hugging Face page, it’s yours to keep; if it’s only reachable through a hosted endpoint, it isn’t. For local work you simply ignore the Max tier — the open lineup already covers every job in that table. It’s the cleanest license story of any major family, with one clearly fenced exception.

Close-up of a consumer GPU graphics card. — Apache weights mean the only ceiling is your hardware — not a usage clause. Photo by Christian Wiediger on Unsplash.

Match the variant — and its size — to your machine

The wall of models is intimidating until you sort it by memory. Almost everything here installs with one Ollama command, and the download size is a fair proxy for the RAM you’ll need. The pattern is the same as the broader local-models field guide: pick the biggest model that fits, leave headroom for context.

What fits on what you own

# 8–16 GB laptop — the daily driver:

ollama run qwen3:8b # 5.2 GB · chat, reasoning, tools

# 24–32 GB Mac or GPU — the efficient powerhouse:

ollama run qwen3:30b # 19 GB MoE · 3B active, runs fast

ollama run qwen3-vl:8b # 6.1 GB · reads images & docs

# Workstation / server — the heavy hitters:

ollama run qwen3-coder:480b # 290 GB · server territory

The 30B mixture-of-experts model is the sweet spot: it holds 30B of knowledge but fires only ~3B per token, so it answers at small-model speed while fitting a loaded laptop.

On a normal 8–16 GB laptop, qwen3:8b is the daily driver — it answers offline forever after one download. Step up to 24–32 GB of memory and the qwen3:30b mixture-of-experts model is the standout: it carries 30B of weights but activates only about 3B per token, so it reasons like a big model at the speed of a small one. Pair it with qwen3-vl:8b for images and you have a private, multimodal assistant on a single Mac.

Two variants break the one-liner pattern, and it’s worth being honest about both. Qwen3-Coder’s 480B build is a 290 GB download — genuinely server hardware; reach for the 30B version on a laptop. And Qwen3-Omni, the audio model, isn’t a clean Ollama pull yet — you run it through Hugging Face Transformers or vLLM in Python. That’s the one rough edge in an otherwise frictionless family.

A studio condenser microphone beside a pop filter. — Qwen3-Omni hears and speaks — 19 spoken input languages, 10 output — but it’s the one variant that still wants a Python runtime. Photo by Leo Wieling on Unsplash.

On vision: Qwen3-VL runs from a 1.9 GB 2B model up to a 143 GB giant, all with a 256K context window. It reads screenshots, converts design mockups to HTML, does OCR in 32 languages, and can follow video up to two hours long. On audio, Qwen3-Omni transcribes and converses across 19 spoken-input languages and replies in real-time speech, with quality Alibaba benchmarks against Gemini 2.5 Pro. On search, Qwen3-Embedding topped the MTEB multilingual leaderboard at launch and comes in a 639 MB 0.6B size — small enough to run alongside your main model for retrieval.

Where Qwen actually loses

Enthusiasm needs a counterweight, so here are the two honest gaps. The first is ecosystem depth. Llama is still the model every tutorial assumes, every fine-tuning script defaults to, and every cloud wires up first. When a deployment breaks at 2 a.m., the Llama answer is already on a forum; the Qwen one might not be. Qwen’s community is huge and growing, but Llama’s gravity — the reason it remains the safe default — is real, and Qwen hasn’t fully matched it.

The second is the reasoning crown. For the hardest chains of math and logic, DeepSeek’s R-series and its distilled small reasoners are often the sharper open pick, and they ship under an even cleaner MIT license. Qwen3’s thinking mode is strong and far more convenient — one switch, no second model — but “strong and convenient” isn’t “best at the frontier of reasoning.” If a single hard task is your whole job, benchmark Qwen against a dedicated reasoner before committing.

One more thing to set expectations: the family moves fast. Qwen3 gave way to a 3.5 line in early 2026 and a 3.6 line that spring, each still Apache 2.0. That cadence is mostly a gift — your ollama run qwen3 habit keeps working as the weights underneath improve — but it means any specific size or score in this post is a snapshot. Check the current roster before you build something load-bearing on it.

So should you standardize on Qwen?

For most people building with open models in 2026, yes — make Qwen the family you learn first. The logic is simple: one name covers chat, code, vision, audio, and search; almost all of it is genuinely Apache-licensed; and four of those five jobs are a single install command. That combination of breadth and permissiveness is unmatched, and it spares you the tax of learning four vendors’ quirks.

Concretely: put qwen3:8b on your laptop today, step up to the qwen3:30b MoE if you have the memory, add qwen3-vl when you need to read images and qwen3-embedding when you build search. Keep DeepSeek bookmarked for the hardest reasoning and Llama in mind for when ecosystem depth matters more than raw quality. Then notice what you’ve actually done: assembled a private, offline, multi-skill AI stack you own outright — the kind of independence the “AI products are mortal” argument keeps insisting you’ll be glad to have. Qwen just makes it the easy choice instead of the principled one.

Run Qwen locally: one open family for chat, code, vision, and audio

Qwen quietly became the most complete open family

One family covers five jobs other labs split up

Most of it is Apache 2.0 — and here’s the catch

Match the variant — and its size — to your machine

Where Qwen actually loses

So should you standardize on Qwen?

Meta Llama, explained: which model is for what, and how to run it

What is an embedding? How AI turns meaning into numbers

What is prompt injection? The flaw every AI agent ships with

One-time payment. Yours forever.