NewsLocal AIIndustryJune 13, 202610 min read

Siri just proved the point: the most personal AI runs on your device

Apple rebuilt Siri and rented a $1B Gemini brain, yet kept the part that reads your life on your device. The line it drew is the lesson.

The new Siri, split in two

WWDC 2026

Apple kept the personal layer on your device, and rented a brain it can’t remember.

Stays on your device

Your semantic index

texts, photos, mail, indexed on the device

What's on your screen

on-screen awareness, read locally

Actions inside your apps

App Intents, run on-device

AFM 3 Core · 3B + 20B sparse

on the Neural Engine

Leaves, but can’t be kept

A sanitized query

the question, not your data trove

A borrowed reasoning brain

a 1.2T Gemini model, rented

Stateless compute

deleted the moment it answers

No retention, by contract

nothing kept, nothing trained on

The line down the middle is the whole story: the parts of Siri that read your life run locally; the parts that don’t need your raw data leave, wrapped so they can’t hold on to it.

When Apple rebuilt Siri, the most newsworthy thing wasn’t what it could finally do. It was where the new Siri runs. The headline feature was an address: your device, first. At WWDC on June 8, 2026, Apple showed an assistant that understands what’s on your screen, acts inside your apps, and holds a real conversation. And it built that assistant around a single architectural decision that’s easy to miss under the demos.

Here is the decision, in one sentence: the parts of Siri that read your actual life run on your device, and the parts that don’t need your raw data leave only as a stripped query, sent to a cloud brain Apple is contractually forbidden to let remember anything. That’s not a privacy footnote bolted onto a cloud product. It’s the shape of the product. And it’s the same argument people building local-first AI have been making for years, now stamped by the most consumer-facing company on earth.

An iPhone resting face-up on a wooden desk surface. — The most personal computer most people own now runs an assistant whose private layer never leaves the glass. Photo by Tyler Lastovich on Unsplash.

What Apple actually shipped, and what’s still coming

This product has a long delay history, so precision matters. The “more personal Siri” was first shown in 2024, then delayed in March 2025 to fix quality problems. It finally arrived in stages. With iOS 26.4 in spring 2026, Siri gained on-device personal context and on-screen awareness: it can use information stored on your phone and act on what’s currently displayed.

The full conversational overhaul (the version powered by a much larger cloud model) was announced at WWDC and lands with iOS 27 alongside the iPhone 18 in the fall. So as of mid-2026, the personal-context machinery is shipping and on-device; the heavy conversational brain is announced and arriving. The split between those two halves is not a rollout accident. It is the design.

The architecture is the headline, not the features

Apple disclosed its third-generation foundation models the same week. On the device sit two models: AFM 3 Core, a 3-billion parameter dense model, and AFM 3 Core Advanced, a 20-billion parameter sparse model that activates just 1 to 4 billion parameters at a time and is stored in flash memory, swapped into active memory only when needed. Both run on Apple silicon’s Neural Engine. Both are quantized with training-aware compression so they fit and stay fast.

These on-device models do the work that touches your data. Reading your screen. Indexing your messages and photos so Siri can find “the recipe my sister sent.” Filling a form from your details. Running an action inside an app. None of that requires a frontier-scale model; it requires proximity to your information, and the fastest, safest place to be near your data is on the silicon already holding it.

A 3-billion-parameter model would be a poor oracle of world knowledge. But that’s not the job. The on-device job is orchestration: understand the request, search a private index you already carry, match it to an app action, and either finish the task or hand a clean, minimal instruction to a bigger model. That kind of route-and-retrieve work fits a small model well; and pinning it to the device buys three things the cloud can’t: it answers in milliseconds with no network hop, it works on a plane or in a dead zone, and it never meters a bill against your own questions. The personal layer wants to be local on the merits, before privacy even enters the argument.

A macro photograph of a computer processor chip with a large letter A etched on its surface. — The Neural Engine matured into a real inference target. A ~3B model on your own chip is fast, free, and offline. Photo by Igor Omilaev on Unsplash.

For the hard reasoning (planning a multi-step task, answering a knowledge question, agentic tool use) Apple escalates. Some of that goes to its own Private Cloud Compute servers running AFM 3 Cloud. The heaviest queries route further, to a much larger model that Apple custom-built with Google, reported to be a roughly 1.2-trillion-parameter Gemini model Apple licenses for around $1 billion a year, confirmed publicly by Google Cloud chief Thomas Kurian at Google Cloud Next in April 2026. Why this matters is the next section.

Where each request actually runs

Apple’s third-generation models span three locations. Notice the sort key: not “how smart,” but “how close to your raw data.”

On your device

AFM 3 Core (3B dense) · AFM 3 Core Advanced (20B sparse)

Personal context, on-screen awareness, app actions, dictation, quick rewrites: anything that touches your indexed data.

Apple's Private Cloud Compute

AFM 3 Cloud · ADM 3 Cloud (Image)

Heavier summarization, image generation and editing, on Apple-silicon servers that delete your data on completion.

Google Cloud GPUs (under PCC)

AFM 3 Cloud Pro · the licensed Gemini model

Complex reasoning, planning, agentic tool use: the heaviest queries, on NVIDIA GPUs, still inside Apple's privacy envelope.

Model lineup from Apple’s Machine Learning Research. The personal layer never leaves the top row.

Apple drew the line at your data, not the smartest model

Read that routing table again and the surprise lands. The smartest model in the stack is not on your device; it’s a rented Gemini brain in a data center. If Apple were optimizing purely for capability, everything would route there. It doesn’t. The on-device models keep the work that touches your indexed life, and only an abstracted query (the question, not the trove behind it) is allowed to leave.

Picture the round trip. You ask, “text Maya the address from the invite that’s on my screen.” The on-screen reading, the look up of who Maya is, the pull of the address from your mail: all of that happens on the device, against an index that never leaves. If the request needs heavier reasoning, what travels upward is a tidy instruction, not your inbox. The cloud brain gets the shape of the task; your data stays behind. That is the difference between renting intelligence and surrendering your life to it.

That’s the lesson, and it’s sharper than “Apple went local.” Apple didn’t go local out of purism; it went local exactly where the data is most personal, and reached for the cloud only where the reasoning genuinely needed scale your phone can’t provide. The boundary isn’t drawn at “how smart.” It’s drawn at “how close to your raw life.”

Developers get to enforce that boundary directly. With the new framework, Apple replaced SiriKit with App Intents and added a privacy manifest that lets an app declare, per interaction, whether a Siri request is allowed to reach the cloud or must stay on-device: a hard switch built for healthcare and enterprise apps where “just send it up” is a compliance violation. That control only makes sense in a system where on-device is the assumed default and cloud is the deliberate exception.

Even the cloud half is built to forget

The honest part of this story is that Apple is hybrid, not absolutist. Plenty of Siri’s reasoning leaves your phone. So the second pillar carries real weight: when data does leave, the cloud is engineered so it physically can’t keep it. Private Cloud Compute runs stateless: your data is processed and then deleted on completion, with writing to storage removed from the compute nodes so nothing survives the response, including in logs or debugging tools.

It’s also verifiable. Apple publishes the measurements of all code running on PCC to an append-only, cryptographically tamper-proof transparency log, so security researchers can confirm the software in production is the software they audited. When Apple extended PCC to third-party infrastructure (the Google GPUs), it kept those guarantees, encrypting the input, the model weights, and the result inside GPU memory while computation runs. And the licensing terms reportedly bar Google from training future Gemini versions on Apple users’ queries. Stateless, verifiable, contractually amnesiac. That is what “cloud, but private” has to mean to be worth saying.

Rows of server racks with dense bundles of network cabling in a data center. — The cloud half is real, but built to delete on completion and prove it did. Photo by Taylor Vick on Unsplash.

This isn’t an Apple story. It’s the direction.

If only Apple did this, you could call it a brand quirk. But the personal layer is migrating onto the device across the whole industry. Google ships Gemini Nano, an on-device model on Pixel and across Android and ChromeOS, for features that need to be instant, offline, and private. Microsoft ships Phi Silica, a small local model behind Windows AI on Copilot+ PCs, and at Build 2026 it widened local AI beyond NPU-only machines to ordinary GPUs, pushing on-device inference onto far more hardware.

The reason the three converged is not coordination; it’s physics and product sense. The personal layer is the part that has to be fastest (an assistant that stutters is worse than none), the part that should survive a lost signal, and the part holding the data a user would least want logged on a stranger’s server. Every one of those pressures points to the same place: the chip in your hand. The heavy reasoning, which is occasional and tolerates a beat of latency, is the part worth sending out. Split the assistant along that seam and you get Apple’s architecture, Google’s, and Microsoft’s, independently arrived at, because the seam is real.

The personal layer is going local everywhere

Apple

AFM 3 Core

Neural Engine, iPhone & Mac

~3B-parameter model handles the personal layer on the device.

Google

Gemini Nano

Pixel, Android, ChromeOS

On-device model for instant, offline, private features.

Microsoft

Phi Silica

Copilot+ NPUs, now RTX GPUs

Small local model behind Windows AI, no cloud round-trip.

One company doing this is a choice. All three doing it is a direction.

The pattern is consistent: a small, fast model lives next to your data for the personal, instant, private work; a big model in the cloud handles the heavy reasoning. That’s the same division personal compute has been moving toward as unified memory, quantization, and open weights matured; and it’s why serious software has always kept its core on the machine. Apple just put the most recognizable assistant on earth on the local side of that line.

The honest caveats: hybrid, not absolutist

Don’t over-read this. Apple did not make Siri fully local, and the post that claims it did is wrong. The most capable model Siri reaches for is someone else’s, running in a data center, on rented GPUs. Private Cloud Compute is genuinely cloud: extremely well-secured cloud, but cloud. Apple’s on-device models are small by frontier standards, and hard questions still route up. The ChatGPT integration, when you opt in, sends some queries off-device entirely with consent.

The point was never “Apple went fully local.” The point is the default. Apple defaulted the personal layer to the device and treated the cloud as the escalation, then spent real engineering (stateless compute, transparency logs, confidential computing, contractual training bans) making the escalation safe to take. A company that ships to more than a billion phones chose that shape on purpose. The default is the argument.

The question Siri hands the rest of your stack

So here is the question to carry out of the WWDC noise. The most personal assistant most people own now keeps its private layer on the device they already hold, and only lets an abstracted query reach a cloud built to forget it. If that’s the right shape for your texts, photos, screen, and schedule, why does the rest of your personal AI stack still phone everything home?

Your chat history, your documents, your half-finished creative work, the files you feed an assistant every day: most of that still rides to someone else’s servers by default, with none of PCC’s guarantees and no transparency log to check. Apple just demonstrated, at the largest possible scale, that personal AI can keep its personal half at home. The tools that don’t are the ones now out of step, and software, as a rule, keeps drifting back to the machine you own. Siri just proved the point on the biggest stage there is.

Siri just proved the point: the most personal AI runs on your device

What Apple actually shipped, and what’s still coming

The architecture is the headline, not the features

Apple drew the line at your data, not the smartest model

Even the cloud half is built to forget

This isn’t an Apple story. It’s the direction.

The honest caveats: hybrid, not absolutist

The question Siri hands the rest of your stack

How to write AI prompts that actually work

Why we built a desktop app in the browser era

What AI can actually do in 2026: a plain-English tour

One-time payment. Yours forever.