CSuite
ListicleLocal AIWindowsJuly 2, 202610 min read

The best on-device AI apps for Windows (2026)

Your PC hides an idle NPU or a gaming GPU. Which one you have decides your best local AI app. The picks, split by silicon.

By Atul
On-device AI · Windows 11
Three machines · one list
Your PC has an NPU or a gaming GPU sitting idle. Which one you have decides your best local AI app.
Copilot+ NPU laptop
40+ TOPS · Snapdragon / Core Ultra / Ryzen AI
LM Studio
Sips battery. Great built-ins, thin for your own LLMs.
NVIDIA RTX tower
CUDA · 8–24 GB VRAM
ComfyUI + Ollama
The fastest local anything. Also the loudest and thirstiest.
AMD / Intel / CPU
ROCm · Vulkan · DirectML
Amuse + LM Studio
It all works. It just takes more setup and homework.
Same jobs, different silicon. The picks below name the hardware each one assumes.

Open Task Manager on a laptop bought in the last year, click the Performance tab, and look for a row that did not exist three years ago: NPU. On a gaming tower, the interesting hardware is the graphics card, with 8 to 24 GB of memory that spends most of its life idle between frames. Both chips can run capable AI models without touching the internet. Almost nobody uses them for it.

That is the gap this post is aimed at. There is a good, growing set of Windows apps that run AI on your own machine: private by default, free or close to it, and available on a plane or in a hospital with the Wi-Fi off. The catch is that Windows is not one machine. Its local-AI story splits three ways, and the app that is perfect on one setup is useless on another.

This is the Windows companion to the Mac list. Same job-by-job format, but tuned to a hardware reality the Mac simply does not have: a thin Copilot+ laptop with a neural chip, an NVIDIA tower where CUDA rules, a pile of AMD and Intel and CPU-only machines in between, and a quiet x86-versus-ARM split that decides whether an app will even install. Every pick below names the hardware it assumes.

A GeForce RTX graphics card mounted inside a desktop PC.
On Windows, the fastest local AI still runs on an NVIDIA GPU and its VRAM. Photo by Christian Wiediger on Unsplash.

Your hardware picks the app before you do

On a Mac, the only spec that matters for local AI is how much unified memory you bought. On Windows, you have to answer a harder question first: what is doing the math? There are four possible answers, and they rank cleanly by how much of the ecosystem targets them.

The NVIDIA RTX GPUis the throne. CUDA has been the default target for AI code for a decade, so nearly every local tool runs best, or only, on NVIDIA. If your machine has an RTX card with 8 GB+ of VRAM, you have the fastest local-AI setup Windows offers, and the widest choice of apps.

The NPUis the newcomer. Microsoft’s Copilot+ PC badge requires a neural chip that hits 40 trillion operations per second, plus 16 GB of RAM. Three silicon lines clear that bar: Qualcomm’s Snapdragon X Elite at 45 TOPS, Intel’s Core Ultra 200V at 48, and AMD’s Ryzen AI 300 with an XDNA 2 engine at 50 TOPS. The NPU is superb at sipping power through the AI features baked into Windows. It is not yet where you run your own chat model: almost no third-party app targets it directly.

Then come AMD Radeon GPUs (compute through ROCm, Vulkan, or DirectML, depending on the card and the app) and, at the floor, CPU-only machines, where the llama.cpp engine inside every local app still gives you a usable 7B model. Here is how each chip actually gets used.

Four accelerators · how a local app actually reaches each one
Chip
What it is
How apps use it
The catch
NPU (Copilot+)
A 40+ TOPS AI chip that barely touches the battery
Windows ML and ONNX Runtime; a handful of built-in features
Almost no third-party chat app targets it yet
NVIDIA RTX GPU
CUDA cores plus 8–24 GB of fast VRAM
First-class in nearly every tool: CUDA, TensorRT-LLM
Power-hungry; the model has to fit in VRAM
AMD Radeon GPU
Compute via ROCm, Vulkan, or DirectML
Ollama (ROCm/Vulkan), Amuse and ComfyUI (DirectML)
ROCm on Windows is patchy—check your exact card
CPU only
The processor you already have
The llama.cpp fallback baked into every local app
Fine for a 7B model; slow once you go bigger

One more Windows-only wrinkle: the x86-versus-ARM64 split. A Snapdragon Copilot+ laptop runs a different instruction set than a normal PC. Windows papers over it with an emulator called Prism, which is genuinely good now and even exposes modern instructions like AVX2 to emulated apps. But emulation costs battery and speed, and it cannot load the kernel-mode drivers some GPU tools need. If you own a Snapdragon machine, an ARM64-native build always beats an emulated x64 one. Watch for it.

Chat: LM Studio runs on everything, and that wins

The everyday chat assistant is the category most people want first, and the pick is easy because one app covers every Windows machine. LM Studio ships x64 and ARM64 builds, runs on Snapdragon, Intel, AMD, and NVIDIA alike, and dropped its commercial-use restriction in July 2025, so it is free at work too. It handles model discovery, chat, and an OpenAI-compatible local server in one window. On an RTX card it uses the GPU; on other hardware it falls back to Vulkan or the CPU. A 7B model at 4-bit idles around 5 GB of memory; a 30B mixture-of-experts model wants closer to 24 GB. Start there.

The runner-up is Ollama’s native Windows app, which now runs as a real desktop program with NVIDIA and AMD Radeon support built in. Its edge is the local server on port 11434 that every other tool in this post can point at. Pair it with Jan or Open WebUI for a friendlier chat window, or with NVIDIA’s ChatRTX if you have an RTX card and want a TensorRT-accelerated demo to chat with your own documents.

The honest caveat is for Snapdragon owners. LM Studio installs and runs, but its GPU and NPU acceleration on those chips is still a work in progress: for now you are mostly running on the ARM CPU cores, which is fine for a small model and slow for a large one. The NPU that makes your laptop a Copilot+ PC is barely used by any chat app yet. That is the single biggest gap in the Windows local-AI story, and it is worth knowing before you buy a machine for this.

Image generation: your GPU brand decides the winner

This is the category where the answer changes most with your hardware. On NVIDIA, the serious tool is ComfyUI, a node-based canvas that runs SDXL, FLUX, and video models on CUDA with more control than anything else. If a graph is too much, the friendliest front end is Krita AI Diffusion, which drives a ComfyUI backend from inside a real painting app and wants a card with 6 GB+ of VRAM. Both are free.

On AMD, the standout is Amuse, a free .NET app built with AMD that runs Stable Diffusion through DirectML with no Python setup at all. It leans on your Radeon GPU (the top FLUX model wants a 24 GB card like the 7900 XTX) and is the least painful way for an AMD owner to make images locally. ComfyUI works on Radeon too, through DirectML or a ZLUDA shim, but expect setup and a speed penalty versus the same card’s CUDA-equipped rival.

Be honest about the gap: for the same dollar spent on a GPU, the CUDA path is faster and better supported than the DirectML one, and most new techniques land on NVIDIA first. AMD image generation is real and improving, not yet effortless. Real-time video generation on a local Windows box remains a render, not a live experience, on any brand.

A podcasting setup with a condenser microphone and headphones on a desk.
Transcription is the local category with the cleanest answer on any Windows PC. Photo by Jonathan Farber on Unsplash.

Transcription: the one job solved on every PC

If your NPU makes you nervous and your GPU is modest, here is the category that just works. OpenAI’s Whisper models are small, open, and accurate enough for interviews, lectures, and legal audio, and they run happily on a plain CPU. There is no reason to upload a recording to anyone.

The pick is Buzz, a free open-source app that runs Whisper locally with three engines to choose from, live microphone capture, and export to TXT, SRT, or VTT. Nothing leaves your machine unless you export it. Vibe is the polished alternative: 90-plus languages, a clean interface, and recent Vulkan GPU support that makes even large models run near real-time on a modest laptop.

Copilot+ owners get a bonus that needs no install. Windows’s Live Captions translate audio from 40-plus languages into English captions in any app, on-device and even offline. It is one of the few Windows AI features that runs entirely local and genuinely earns the NPU.

Coding and notes: aim your tools at a local server

Two of the most valuable local-AI jobs share one trick: point a familiar app at the Ollama or LM Studio server already running on your PC, and no keystroke leaves the machine.

For coding, install Continue in VS Code for inline autocomplete and chat, or Cline for an agent that edits files and runs commands with your approval. Both detect your local models automatically; a 7B coder model such as Qwen2.5-Coder is the sweet spot on most hardware. The power move for Windows: run the server on the RTX desktop in the closet, then code from a thin laptop by pointing the extension at that box over your LAN. One GPU, every device.

For notes and documents, the private-first picks are AnythingLLM, which chats across PDFs, Word files, and whole codebases with a local-model backend, and GPT4All, whose LocalDocs feature is the simplest way to ask questions of a folder of files. Both build their search index on your disk. For a heavier corpus, these are the tools that keep your most sensitive material (client work, contracts, medical notes) off every server but your own.

A thin modern laptop open on a desk.
A Copilot+ laptop hides a 40+ TOPS NPU that most apps still don’t touch. Photo by Jaime Marrero on Unsplash.

What Windows ships (and where “on-device” bends)

Some of the best local AI on Windows is already installed. Microsoft put real on-device models into the operating system, and on a Copilot+ PC a few of them are excellent. A few others quietly reach for the cloud, and the marketing does not always make the difference obvious.

The genuinely local ones: Live Captions, covered above; the Click to Do screen actions that run on the NPU; and Windows Recall, the searchable timeline of your screen. Recall drew heavy fire at launch and Microsoft rebuilt it: it is now opt-in, its snapshot database is encrypted in a secure enclave, and it unlocks only with Windows Hello. It is local, but filtering of passwords and card numbers is still imperfect, so treat it as powerful and slightly leaky.

The one that bends the label: Cocreator in Paint, the text-to-image feature, is marketed as an on-device NPU experience, yet it requires a Microsoft account and an internet connection for content moderation in the cloud. Useful, but not the private, offline tool the framing implies. If you want a fully local image generator, the picks above still stand.

For the technically curious, Microsoft’s Windows AI Foundry (announced at Build 2025) is the plumbing underneath all of this: awinget-installable runtime called Foundry Local, an inbox small model named Phi Silica you can even fine-tune with LoRA, and Windows ML as the layer that steers a model to the NPU, GPU, or CPU without the app having to care. It is a developer story today, but it is why the built-in features work across four vendors’ chips.

Where it’s not ready, and how to choose

Three gaps keep this from being a solved problem. Frontier reasoning still belongs to the cloud; a local 30B model is excellent but not GPT-5. The NPU, for all its battery magic, is not yet a place you run your own chat model, so a Copilot+ laptop with no discrete GPU is the weakest machine here for heavy local AI. And Snapdragon owners face the thinnest app catalog, since some tools ship x64-only and lean on emulation. None of that is permanent, but all of it is true today.

With that said, the choice comes down to your silicon. Answer four questions and the picks fall out.

Four questions · a decision tree for Windows hardware
The question
If yes
If no
Do you have an NVIDIA RTX GPU with 8 GB+ of VRAM?
ComfyUI for images, Ollama or LM Studio for chat: everything is fastest here
You are on the NPU / AMD / CPU path—lean on LM Studio and Amuse
Is it a thin Copilot+ laptop on Snapdragon (ARM64)?
Install ARM64-native builds: LM Studio ships one; skip x64-only apps
Any x64 app runs—you have the widest catalog of the three
Do you ever need the absolute frontier of reasoning?
Keep one cloud subscription for that task; run the other 80% locally
A 7B–30B open model on your machine covers the daily work
Are you in a locked-down or offline environment?
Prefer Store or winget installs; verify no cloud calls before trusting it
Direct downloads open up ComfyUI, Ollama, and the full toolbox

The short version: if you have an RTX card, ComfyUI and Ollama make your machine the best local-AI box in this guide. If you have a Copilot+ laptop, LM Studio for chat and Buzz for transcription cover the daily work, and the built-in features earn the NPU while the third-party world catches up. On AMD or a plain CPU, Amuse and LM Studio still get you there with a little more homework. Most of the list is free; the rest costs less than a single year of a cloud subscription.

The frame outlasts the picks, which will move as fast as the hardware: name the accelerator you actually have, prefer the app with a native build for it, and verify that anything sold as “private” stays offline before you trust it. If you want to go deeper on which weights to download, the roundup of open models worth running is the next stop, and the case for personal compute explains why this whole category is moving back onto machines you own.

More reading
Launch offer · 50% off

One-time payment. Yours forever.

No subscriptions. No seats. No renewals. Buy CSuite once, future updates included.

$98$49
Pricing

Secure checkout via Stripe. Already have a license? Download the app