Local AIOpen SourceTutorialJune 29, 20269 min read

Run Phi locally: Microsoft's small models that beat far bigger ones

A 14B model you can fit on a laptop beat a 671B one at competition math. For Phi, small isn't a compromise. It's the design.

By Atul

AIME 2024 · competition math, % solved

MIT licensed

The smallest model on this chart posts the highest score.

81.3

78.7

63.6

Phi-4-reasoning-plus

14B · 9 GB · your laptop

DeepSeek-R1

671B · 404 GB · datacenter

o1-mini

closed · API only

Phi-4-reasoning-plus is roughly 48x smaller than DeepSeek-R1 and still edges it on this exam. Small, for Phi, isn’t a compromise. It’s the design.

In April 2025, Microsoft published a benchmark table that looked like a typo. A 14-billion-parameter model, small enough to download in under ten gigabytes and run on a laptop, scored higher on a competition math exam than DeepSeek-R1, a model with 671 billion parameters that needs a rack of server GPUs just to load. The small one won, 81.3 to 78.7.

That model is Phi-4-reasoning-plus, the latest in Microsoft’s Phi family, and the upset isn’t an accident. Phi is the one major model line built from the start around a single bet: a small model trained on carefully chosen data can do most of a big model’s daily work. If you have a modest laptop (8 to 16 GB of memory, no fancy GPU) and you’ve assumed local AI wasn’t for you, this is the family that says otherwise. Here is the map: which Phi does what, what it costs you in disk and memory, and the one command that runs it.

Phi bet on data quality, not raw size

The story starts in 2023 with a paper whose title was a thesis: “Textbooks Are All You Need.” Microsoft researchers trained a 1.3-billion-parameter coding model, phi-1, on just 7 billion tokens of what they called “textbook-quality” data: cleaned web content plus synthetically generated exercises. It scored 50.6% on the HumanEval coding benchmark, rivaling models many times its size trained on far more data. The lesson: what you feed a model matters as much as how big you make it.

Every Phi model since has run on that idea. While most of the industry chased scale (the biggest models trained on the most tokens), Microsoft chased curation. The result is a family with an unusual shape: the models are small, the training data is hand-picked, and the capability-per-gigabyte is the best in open AI. Phi answers a specific question, “what is the most a small model can do?”, rather than “how big can we go?”

The industry now has a name for this category: the small language model, or SLM. A frontier model with hundreds of billions of parameters is a general-purpose engine you rent by the token. An SLM is something else: small enough to live on your own machine, fast enough to answer instantly, cheap enough to run all day. Phi is the family that made the case the SLM idea was worth taking seriously, and the Phi-4 generation is where the bet paid off. Even the flagship 14B model was trained on 9.8 trillion tokens using roughly 1,900 GPUs over three weeks, then squeezed into a file you can hold on a laptop.

A stack of open books, evoking curated textbook-quality training data. — The founding idea of Phi: feed a small model textbook-quality data instead of the whole internet. Photo by Patrick Tomasso on Unsplash.

That focus has a cost, and it is worth saying up front. A small model curated for reasoning and code knows less about the world’s long tail than a giant trained on the whole web. Ask Phi about an obscure historical figure and it is likelier to stumble than a frontier model. But for the work most people actually do (summarizing, drafting, reasoning through a problem, writing and fixing code) the gap is smaller than the size difference suggests. That is the whole pitch.

A 14B model beat a 671B one at competition math

Back to that benchmark table. Phi-4-reasoning-plus is a 14-billion-parameter model fine-tuned for step-by-step reasoning. On AIME 2024, a hard math-olympiad qualifier, it scored 81.3%. DeepSeek-R1, at 671 billion parameters, scored 78.7%. On AIME 2025 the gap widened: 78.0% for Phi, 70.4% for R1, per the Phi-4-reasoning-plus model card. The Phi model is roughly 48 times smaller and still came out ahead on both.

AIME is the American Invitational Mathematics Examination, a qualifier for the US Math Olympiad, and it is a deliberately brutal test: no multiple choice, no partial credit, just hard problems that reward genuine reasoning over recall. That is the kind of task Phi was tuned to win. The “plus” variant earns its name by spending more tokens at inference time, thinking through a problem in longer chains before it answers, which is how it edges out a model nearly fifty times its size on these exams.

A 14B model against the giants, win and lose

Benchmark

Phi-4-r-plus · 14B

R1 · 671B

o1-mini

o3-mini

AIME 2024

81.3

78.7

63.6

AIME 2025

70.4

54.8

GPQA-Diamond

68.9

77.7

OmniMath

81.9

—

74.6

Violet marks where the 14B Phi beats the 671B DeepSeek-R1: both AIME math exams. It trails on GPQA science and OmniMath. Small Phi closes the gap fastest on technique, not on raw breadth. Scores from the Phi-4-reasoning model card.

Be honest about the rest of the table, because the family’s value depends on it. On GPQA-Diamond, a set of PhD-level science questions, Phi-4-reasoning-plus scores 68.9% against R1’s 73.0%, and it trails on the OmniMath set too. The pattern holds across the family: small Phi models close the gap fastest on structured math and reasoning, where curated practice data pays off, and trail on broad knowledge and the hardest frontier problems. They win the tests that reward technique, not the ones that reward sheer memorized breadth.

The same trick scales down. Phi-4-mini-reasoning, a 3.8-billion-parameter model that downloads in about 3 GB, scored 57.5% on AIME 2024, beating a 7B distilled version of DeepSeek-R1 and approaching OpenAI’s o1-mini, by Microsoft’s own reporting. A model small enough for a phone, doing competition math at a level that would have embarrassed far larger ones two years ago.

The whole lineup is MIT, with no asterisks

Here is where Phi gets easy in a way its rivals don’t. Every model in the current family ships under the MIT license: Phi-4, Phi-4-mini, Phi-4-multimodal, the reasoning variants, and the newest Phi-4-reasoning-vision released in March 2026. MIT is about as permissive as software licenses get. Run it, modify it, fine-tune it, ship it in a commercial product, and sell that product, with no revenue cap, no user ceiling, and no special permission.

That clean story is rarer than it should be. The Llama licenseadds a 700-million-user ceiling and attribution rules; parts of Mistral’s catalog are research-only or carry a revenue tripwire. Phi has none of it. The contrast matters most for the exact readers Phi targets: a solo developer or a small team can build on any Phi model without a lawyer in the loop.

The practical upshot is that “open” means the same thing here as it does for the everyday software libraries you already depend on. You can fine-tune Phi on your own data, bundle it inside an app you sell, run it behind a paywall, or fork it entirely, and Microsoft has no claim on what you make. For anyone who has read the fine print on other “open” models and come away unsure what they were actually allowed to ship, Phi is the family with nothing to re-read.

There’s a Phi for every job, and most fit a thin laptop

The Phi-4 roster: every row is MIT

Model

Params

Ctx

What it's for

Phi-4-mini

3.8B

128K

Everyday pick for weak hardware: chat, summarize, classify

Phi-4

14B

16K

The laptop step-up: strong general reasoning and code

Phi-4-multimodal

5.6B

128K

Vision + audio + text in one; #1 open speech recognition

Phi-4-mini-reasoning

3.8B

128K

Tiny model that thinks step by step on math and logic

Phi-4-reasoning(-plus)

14B

32K

The benchmark winner: deep chain-of-thought reasoning

Phi-4-reasoning-vision

15B

16K

Newest (Mar 2026): reasoning over images and charts

Params and context windows from the Hugging Face model cards. The whole current generation ships under the MIT license, including the March 2026 reasoning-vision model.

The roster splits cleanly by job. Phi-4-mini (3.8B) is the everyday pick for weak hardware: a 2.5 GB download with a 128K-token context window, enough to chat, summarize, classify, and call functions on an 8 GB machine. Phi-4 (14B) is the step up, a 9.1 GB download that fits a 16 GB laptop and posts strong general scores (84.8 on MMLU, 80.4 on the MATH benchmark).

Then the specialists. Phi-4-multimodal (5.6B) folds vision and audio into one model; its speech component, just 460 million parameters, topped the OpenASR speech-recognition leaderboard with a 6.14% word error rate, ahead of Whisper-v3. The reasoning variants trade a smaller context window for the step-by-step thinking that won the math benchmarks, and March 2026’s reasoning-vision-15B extends that thinking to images and charts. All MIT, all runnable on hardware most readers own.

A minimal desk with a thin laptop, the kind of modest hardware Phi targets. — Phi shines at the bottom of the hardware ladder: a thin laptop with no dedicated GPU is enough for the mini tier. Photo by Dhony Koswara on Unsplash.

You may already run Phi without knowing it

A close-up of a processor chip on a circuit board, standing in for the NPU. — On Copilot+ PCs, Phi Silica runs on the NPU, a chip built for AI, so the model never touches the cloud. Photo by He Junhui on Unsplash.

If you bought a Windows laptop in the last year and a half, there is a good chance Phi is already on it. Microsoft ships a version called Phi Silica built into Windows 11 on Copilot+ PCs, the class of machines with a neural processing unit (NPU) rated above 40 trillion operations per second. Phi Silica runs entirely on that chip, not in the cloud, and powers on-device features like text summarization and rewriting.

The on-device numbers are the interesting part. At launch Microsoft reported Phi Silica generating up to 20 tokens per second with a 230-millisecond time to first token, while the context-processing step drew just 4.8 milliwatt-hours of power. Running the model on the NPU instead of the CPU cut power use by 56%. That is the difference between an AI feature that drains your battery and one you forget is running.

This is the same logic behind Apple keeping the personal layer of Siri on the device. The most private, most always-available AI is the kind that never leaves your hardware, and a small, efficient model is what makes that possible. Phi is Microsoft’s version of that bet, shipping at the scale of every new Windows PC.

Pick by your RAM, then run one command

The fastest way to try Phi yourself is Ollama, the same one-command runner the rest of the local-models guide uses. Install it, then match the model to the memory you have. Download size is a fair proxy for the RAM you will need.

Match the model to the memory you have

RAM

Command

Size

What you get

8 GB

ollama run phi4-mini

2.5 GB

3.8B with 128K context: the modest-laptop default

8 GB

ollama run phi4-mini-reasoning

3.2 GB

Tiny model that thinks step by step on math

16 GB

ollama run phi4

9.1 GB

The 14B workhorse: strong general assistant

Sizes are Ollama’s 4-bit builds; phi4-mini needs Ollama 0.5.13 or newer. Pick the largest that fits, then leave a few gigabytes of headroom for context.

For most people the answer is phi4-mini on a light machine or phi4 on a 16 GB one. Want local reasoning for math and logic? phi4-mini-reasoningfits in about 3 GB and thinks step by step. Every command lands on an MIT model, so whatever you build on top is yours to keep. Prefer a graphical app? LM Studio pulls the same models with RAM-aware quantization hints, and on a Mac, Apple’s MLX runs them fastest. Microsoft also ships pre-converted ONNX builds and a small runtime called Foundry Local that auto-detects your NPU or GPU.

One sizing note worth keeping in mind: the number on the model card is the parameter count, but the number that decides whether it runs is the download size, because that is roughly how much memory the weights occupy. A 9.1 GB model wants a machine with comfortably more than 9 GB free, since the operating system, your browser, and the model’s working context all compete for the same RAM. When in doubt, drop a tier. A model that fits and runs smoothly beats a bigger one that swaps to disk and crawls.

Where Phi stops, and which one to install

Phi is not a frontier-model replacement, and Microsoft doesn’t pretend otherwise. For the hardest research questions, the broadest world knowledge, or production work that needs the single best answer, a large frontier model still wins. Phi’s multilingual coverage is thin too: only about 8% of Phi-4’s training data was non-English, and the multimodal model’s vision is English-only. Know the ceiling before you build.

But that ceiling sits higher than the size suggests, and the floor is remarkably low. If you take one thing away: you do not need a big GPU to run genuinely useful AI. Install phi4-mini if your laptop is modest, phi4 if it has 16 GB, and phi4-mini-reasoning if you want a tiny model that reasons. All three are one command and a few gigabytes away, all MIT, all yours. The family that bet on small just keeps proving small was enough.

Run Phi locally: Microsoft's small models that beat far bigger ones

Phi bet on data quality, not raw size

A 14B model beat a 671B one at competition math

The whole lineup is MIT, with no asterisks

There’s a Phi for every job, and most fit a thin laptop

You may already run Phi without knowing it

Pick by your RAM, then run one command

Where Phi stops, and which one to install

Government can switch off a frontier AI model. In June, one did, twice.

Run Mistral locally: Europe's open family, and the license lines you can't cross

Run DeepSeek locally: the reasoning model in a laptop-sized package

One-time payment. Yours forever.