NewsAI ModelsLLMsMay 20, 202612 min read

The last three months in AI: Text models

Six flagships in eight weeks, an open-weight model that tied GPT-5.5 on coding, and Meta walked away from Llama. The Feb–May 2026 LLM catalog.

By Atul

Spring 2026 · Text models

Feb 20 – May 19

Six new flagships in eight weeks, the open-weight tier finally tied the closed leaders on coding, and Meta walked away from Llama.

Feb 16
Alibaba
Qwen 3.5: native multimodal agent
Mar 30
Alibaba
Qwen 3.5-Omni: text + audio + video
Apr 8
Meta
Muse Spark: Meta's first closed model
Apr 16
Anthropic
Claude Opus 4.7
Apr 20
Moonshot AI
Kimi K2.6: 1T open-weight
Apr 23
OpenAI
GPT-5.5 / Pro / Thinking
Apr 24
DeepSeek
V4-Pro / V4-Flash: MIT, 1M context
May 19
Google
Gemini 3.5 Flash + Gemini Omni

Eight days in April set the shape of the quarter. On the 8th, Meta shipped its first proprietary, closed-weight model and announced that the Llama era was over. Twelve days later, a 1-trillion-parameter open-weight model from a Beijing startup called Moonshot AI matched OpenAI’s brand-new GPT-5.5 on coding benchmarks. Two days after that, DeepSeek dropped a 1.6T MIT-licensed model with a million-token context. The biggest LLM story of Q2 2026 isn’t any one flagship: it’s that the open and closed tracks finally crossed in the middle, and the lab that built the open one was no longer Meta.

The rest of the quarter rhymes. Six new flagship models shipped between February 20 and May 20, plus a handful of consequential point upgrades. The closed tier ratcheted prices up; the open tier ratcheted them down. If you only have ten minutes for an LLM update this quarter, spend it with Kimi K2.6, GPT-5.5, and the Gemini 3.5 Flash pricing card. Those three explain almost everything else.

A black and white close-up of a vintage Triumph typewriter, keys catching the light. — Quarterly roundup, wire-service style. Photo by Markus Spiske on Unsplash.

April was the cluster

Quarter-to-quarter, you usually get one or two headline launches and a long tail. This one had a thirty-day stretch with six flagships back to back. Meta’s Muse Spark on April 8 was the first: Meta’s first closed-weight model and the first output of the Superintelligence Labs reorg under Alexandr Wang. Eight days later, Anthropic shipped Claude Opus 4.7. On the 20th, Moonshot AI released Kimi K2.6, a 1T-param MoE under a Modified MIT license. On the 23rd, OpenAI released GPT-5.5 and GPT-5.5 Pro. The 24th brought DeepSeek-V4-Pro and V4-Flash, both MIT-licensed with a 1M-token context. The cluster closed three weeks later at Google I/O on May 19, with Gemini 3.5 Flash and Gemini Omni Flash.

The compression tells you something the individual launches don’t. Every major lab now ships on the same schedule because they’re all racing the same one (agentic coding revenue from coding agents and IDEs), and the eight-week cycle of training-evals-release has locked into spring and autumn windows. Expect another cluster between Labor Day and Halloween.

What shipped: three frontier flagships, three bets

Three labs shipped new top-of-line models in the window, and the three bets almost don’t overlap.

Three frontier flagships · Apr 16 to May 19

Model

Lab

Date

The bet

Claude Opus 4.7

Anthropic

Apr 16

Long-horizon agentic coding: 13% lift over 4.6, 98.5% on the computer-use visual-acuity benchmark.

GPT-5.5

OpenAI

Apr 23

Reasoning as a dial: five effort levels including xhigh, 82.7% Terminal-Bench, double the input price of 5.4.

Gemini 3.5 Flash

Google

May 19

Flash that beats last quarter's Pro: 1M context, $1.50/$9 per M tokens, the new default in the Gemini app.

Anthropic went deepest on agentic coding. Opus 4.7 landed with a reported 13% lift on coding benchmarks over 4.6 and a striking 98.5% on the visual-acuity test that gates computer-use agents (4.6 sat at 54.5%). Pricing stayed at $5/$25 per million tokens, but a new tokenizer means the same English text uses up to 35% more tokens than 4.6, so the per-request bill rose without the per-token number changing. The 4.7 release also bumped vision to 2,576-pixel inputs and added file-system memory for multi-session work.

OpenAI turned reasoning into a dial. GPT-5.5 exposes five effort levels (none, low, medium, high, xhigh) and the gap between the top two is wide enough that most comparison tables list them as separate models. It scored 82.7% on Terminal-Bench 2.0 and 35.4% on FrontierMath Tier 4 at top effort, both records at launch. Pricing doubled to $5/$30 per million tokens for the standard model and $30/$180 for the Pro variant, with a 1.1M-token context. GPT-5.5 Instant landed as the new ChatGPT default for free users on May 5.

Google shipped neither a new Pro nor a new Ultra. It shipped a Flash that beats last quarter’s Pro. Gemini 3.5 Flash outperforms Gemini 3.1 Pro on Terminal-Bench (76.2% vs 70.3%) and on the GDPval-AA agentic suite (1656 Elo vs 1317), at $1.50/$9 per million tokens and a 1M-token context. It is, as of writing, the new default model in the Gemini app and in AI Mode. Gemini Omni Flash, announced the same day, is a sibling model: a native any-to-any multimodal that turns text, image, audio or short video into a 10-second video clip with synchronised audio. The 3.5 Pro tier is slated for June.

A pile of folded newspapers stacked on top of each other. — Six flagships in eight weeks: more catalog than any one editor can carry. Photo by Greg Bulla on Unsplash.

What got better: the open-weight tier closed the gap

Below the closed flagships, the more interesting story is on the open-weight side, where four releases collectively moved the open floor from “a step behind” to “inside two points on coding.”

Qwen 3.5 opened the quarter on February 16 with a 397B-parameter sparse MoE that activates 17B per token, trained natively on text, images, video and UI screenshots. Alibaba followed with the smaller dense models (0.8B/2B/4B/9B) on March 2, then Qwen 3.5-Omni on March 30 with a Thinker-Talker architecture for unified text-audio-video streaming. The Qwen 3.6-Plus refresh on April 2 targeted agentic coding specifically. By May, Qwen was the open-weight family running on the most laptops outside China.

Kimi K2.6 is the headline. Released April 20 under a Modified MIT license, it’s a 1T-parameter MoE with 32B active, a 262K context, INT4-native quantisation, and four serving variants up to a 300-agent parallel swarm. On SWE-Bench Pro it scored 58.6, ahead of GPT-5.4 at 57.7 and Claude Opus 4.6 at max effort (53.4). It is, by that benchmark, the first open-weight model to walk into a closed-frontier benchmark cluster and beat the closed leader.

DeepSeek V4-Pro and V4-Flash arrived four days later with a different bet: efficiency at scale. The V4-Pro is a 1.6T MoE with 49B active, MIT-licensed, 1M-token context. Its hybrid Compressed Sparse Attention plus Heavily Compressed Attention scheme uses 27% of V3.2’s inference FLOPs and 10% of its KV cache at the million-token setting. On SWE-Bench Verified, V4-Pro-Max scored 80.6%, within 0.2 points of Claude Opus 4.6.

Two points worth noting under any “open caught up” headline: the US AI Safety Institute’s independent eval of V4-Pro put it about eight months behind the closed frontier on broader capability work, and Meta’s walk-away from open weights means there’s no longer an American major committed to open release at the frontier. Closing the coding gap is real; closing the whole-model gap is more contested. The case for running these on your own machine is now stronger than at any point in the past year, but the sustainability of that supply chain is now a Chinese-lab question, not an American one.

Under the hood: four convergences

Across the catalog, the underlying recipe converged on four moves. None of them is unique to one lab. That’s the point.

Reasoning effort is now a dial, not a model. GPT-5.5 ships with five effort levels. Opus 4.7 has “adaptive reasoning” with a max-effort knob. Gemini 3.5 Flash has dynamic thinking on by default. Kimi K2.6 exposes the same. A year ago a separate “o-series” reasoning model existed alongside the base; this quarter, every major lab folded thinking into the same endpoint and exposed effort as a parameter. The implication for application builders is that “which model” is now an incomplete spec: you also need to record which effort level.

Mixture-of-Experts is the default architecture. Every one of the six flagships above is sparse. Qwen 3.5 activates 17B of 397B; Kimi K2.6 activates 32B of 1T; DeepSeek V4-Pro activates 49B of 1.6T; Muse Spark is sparse; GPT-5.5 and Opus 4.7 are (per leaks and third-party teardowns) sparse too. Dense transformers at the frontier are over.

Long-context efficiency tricks reached production. Million-token context is now table-stakes, but real use is finally cheaper than it was at launch. DeepSeek V4’s Compressed/Heavily Compressed Attention pair, Qwen 3.5’s Gated Delta Network linear-attention hybrid, and Gemini 3.5 Flash’s $0.15 cached-input pricing all push toward the same outcome: long context that doesn’t bankrupt the user. The companion piece on what those million-token numbers actually deliver is still the right read (the marketing claims continue to outrun the practical reality), but the trajectory is closing the gap rather than widening it.

Native multimodal moved from research to product. Qwen 3.5-Omni, Gemini Omni, and Muse Spark were each trained from scratch on text, vision, audio and (in the case of Gemini Omni and Qwen-Omni) video, instead of bolting a vision tower onto a text backbone. Real-time voice with cross-modal grounding became a normal endpoint, not a demo. The text installment of this roundup probably won’t exist as a separate post in two years: the line between text and the other modalities is dissolving.

A rotary printing press at work, rolls of paper feeding through inked cylinders. — Under the hood, the technique stack converged faster than the catalog. Photo by Bank Phrom on Unsplash.

Trend lines: four patterns across the quarter

Across the catalog, four things rhyme. None of them are obvious from any single launch.

1. Closed prices doubled at the top. GPT-5.5 ($5/$30) is twice GPT-5.4 ($2.50/$15). Opus 4.7 nominally held at $5/$25 but the new tokenizer adds 0–35% to the bill. Gemini 3.5 Flash is 3× the input price of Gemini 3 Flash, though still well under Gemini 3.1 Pro. Frontier intelligence is getting more expensive in absolute terms, even as cost-per-task can drop on the same workload: the lever to use is prompt caching and batch tiers, not the sticker price.

2. Open prices kept falling. Kimi K2.6’s commercial license is free below 100M MAU; DeepSeek V4-Flash on Hugging Face is MIT and 13B-active: it runs on a single H100 node. Local-first inference for serious work is more credible this quarter than last.

3. Meta walked away. Muse Spark is closed-weight, has no Llama branding, and runs only inside Meta AI and a private API preview. Meta Superintelligence Labs has not committed to any future open release. Three years of Llama as the open-source anchor of US LLM research is over. The replacement at the frontier of open is Chinese-lab work: Qwen, DeepSeek, Kimi. That’s a structural shift, not a one-quarter blip.

4. The leaderboard is now a configuration. The top five on the Artificial Analysis Intelligence Index as of May 20 are all the same models people were running in March, with different effort settings. GPT-5.5 at xhigh and GPT-5.5 at high are treated as separate entries. Opus 4.7 with adaptive reasoning at max effort sits in the cluster with Gemini 3.1 Pro Preview. Picking “the best model” in 2026 means picking a model and an effort setting; the field has compressed enough that the second choice often matters more than the first.

Quiet quarter for

Three places were unusually silent for a window this active. xAI spent the quarter on Grok point releases and on the Imagine multimodal endpoint; Grok 5, originally announced for Q1, slipped to a May–June public beta. Mistral shipped Medium 3.5 and Codestral updates but no new frontier-tier flagship, and Mistral Large 3 from December 2025 remains the open-source MoE flagship. Apple Intelligence’s long-promised LLM-Siri refresh slipped from iOS 26.4 in March into the iOS 26.5/27 window. An April update did expand language support, but the major on-device Siri was deferred. The pattern looks like a slower cohort and a faster cohort. Anthropic, OpenAI, Google, Alibaba, DeepSeek, Moonshot, and Meta are now on roughly the same release cadence; xAI, Mistral, and Apple are at least one cycle behind.

What to watch next quarter

Four things are queued up between May and August and worth marking on a calendar.

Gemini 3.5 Pro. Google has confirmed a 3.5 Pro tier is in internal use and slated for June. It is the obvious answer to GPT-5.5 Pro and Opus 4.7 at max effort, and the question is whether Google ships an Ultra alongside.

Claude Sonnet 4.8. Anthropic’s release cadence puts a Sonnet refresh one to four weeks after an Opus, meaning a Sonnet 4.8 should land between mid-May and June. As of publication, it hasn’t shown up in the model list, but the leaks point to it. Sonnet, not Opus, is the workhorse most teams deploy in production, so the upgrade matters more than the headline.

EU AI Act enforcement begins August 2. EU-level fines for general-purpose AI providers apply from August 2, and the bulk of the high-risk system rules enter into force the same day. The Code of Practice gives most existing models a two-year transition window to 2027, but the gun goes off in early August. For US teams shipping into the EU, the legal exposure argument for on-device or BYOK designs gets sharper this summer.

Open-weight reasoning at top effort. Kimi K2.6 is the proof of concept; the next obvious step is a Chinese-lab model with an explicit max-effort dial that beats GPT-5.5 at xhigh on a public benchmark cluster. If it happens, it happens in the next 90 days.

The leaderboard, as of May 20

For calibration. The closed top is tight: everyone inside the top five sits within three points of each other on the Intelligence Index. The first open-weight entry is two more points behind, which is closer than it’s ever been.

Artificial Analysis Intelligence Index · top of board, May 20 2026

Model

Lab

Index

GPT-5.5 (xhigh)

OpenAI

GPT-5.5 (high)

OpenAI

Claude Opus 4.7 (max effort)

Anthropic

Gemini 3.1 Pro Preview

Google

GPT-5.4 (xhigh)

OpenAI

Kimi K2.6 (open weights)

Moonshot AI

Open vs closed, coding benchmarks · May 2026

DeepSeek V4-Pro-Max

DeepSeek · MIT

80.6%

Claude Opus 4.6 (max effort)

Anthropic · Closed

80.8%

Kimi K2.6

Moonshot AI · Modified MIT

58.6 (SWE-Pro)

GPT-5.4 (xhigh)

OpenAI · Closed

57.7 (SWE-Pro)

Top two rows: SWE-Bench Verified. Bottom two: SWE-Bench Pro. Open-weight bars in violet; closed in grey. Both gaps are inside two points.

The same answer to a hard question, from any of these five, would be more or less the same answer. The reason to choose between them is no longer raw intelligence: it’s effort dial, latency, tokenizer cost, eval shape, licensing, and what tool-use harness the model was actually trained against. That makes choosing a default a curation problem more than a capability one, which is the case made in you don’t need every AI model. The companion piece this quarter, The last three months in AI: Image models, covers the same window on the other side of the modality wall.

Next installment in this series: The last three months in AI, August 2026.