NewsAI ModelsImage AIMay 19, 202610 min read

The last three months in AI: Image models

Reasoning came to image generation, native typography stopped being a tell, and the leaderboard turned over twice. The Feb–May 2026 catalog.

By Atul

Spring 2026 · Image models

Feb 10 – May 19

Seven launches, one new #1 on the arena, and the day image models started reasoning.

Feb 10
Alibaba
Qwen-Image-2.0 (7B, native 2K)
Feb
Black Forest Labs
FLUX.1.2 Pro Ultra (4 MP)
Mar 17
Midjourney
V8 Alpha (5× faster, 2K)
Mar 19
Adobe
Firefly Image 5 (native 4 MP)
Apr 14
Microsoft
MAI-Image-2-Efficient
Apr 21
OpenAI
ChatGPT Images 2.0 / gpt-image-2
May 4
xAI
Grok Imagine Quality Mode

The biggest image-AI story of the quarter happened on a Tuesday in April and lasted twelve hours. OpenAI shipped ChatGPT Images 2.0 on April 21, 2026. Inside half a day the underlying model, gpt-image-2, had taken the top spot on the Artificial Analysis Image Arena by 242 Elo points, a margin that, on a leaderboard usually decided by single digits, reads less like a launch than an eviction. The new feature wasn’t a better diffusion sampler. It was a reasoning loop bolted to the front of one.

That detail is the whole quarter in miniature. Between February 10 and May 19, image models stopped being just pretty-picture engines. They started to plan, search, verify, and edit in the same call. If you skipped the last three months and want the short version: spend an hour with Qwen-Image-2.0, gpt-image-2, and Nano Banana Pro. Those three are the new floor, and a year-old subscription to any one model is now a much riskier default than it used to be.

A set of six 35mm film strips arranged like a contact sheet on a flat surface, warm orange tones. — Quarterly roundup, contact-sheet style. Photo by Shannia Christanty on Unsplash.

April 21 was the day image models started thinking

Reasoning showed up in image models the way reasoning showed up in text models in late 2024: a slower, more deliberate mode that thinks before it produces. OpenAI’s launch post describes two modes (Instant for fast output, Thinking for the deliberate one) with Thinking restricted to paid tiers. Under the hood, Thinking runs a planning step, can pull web references, generates candidates, and re-checks the result against the prompt before returning. The headline win is character and object consistency across frames, the place every previous image model was the weakest. The secondary win is text inside images, which finally reads like type and not like a hallucinated guess at letterforms.

Two days before that launch, the arena’s top model was GPT Image 1.5. Two days after, it was GPT Image 2 at 1338 Elo, with the next four seats split between Google’s Nano Banana family, the previous-generation OpenAI model, and Microsoft. The thing that beat all of them was, by their own benchmarks, the same family with a reasoning loop on top.

What shipped: three new flagships, three bets

Three labs shipped genuinely new flagship image models in the quarter, not point updates. Each one made a different bet, and the three bets almost don’t overlap.

Three new flagships · Feb 10 to Apr 21

Model

Lab

Date

The bet

Qwen-Image-2.0

Alibaba

Feb 10

Cut the parameters (20B → 7B), keep the typography crown, ship native 2K.

MAI-Image-2-Efficient

Microsoft AI

Apr 14

Same family, leaner inference: 22% faster, 4× the GPU throughput on H100s.

gpt-image-2

OpenAI

Apr 21

Stop drawing first. Plan, search, generate candidates, verify against the prompt.

Alibaba opened the quarter on February 10 with Qwen-Image-2.0, a 7-billion-parameter rebuild of last year’s 20B model that beat the older one on DPG-Bench (88.32 vs 83.84) at a third of the weight class, with native 2K resolution and a unified generation-and-editing path in a single model. For most of the spring it sat at #1 on the LM Arena image leaderboard for both text-to-image and editing. The API is on Alibaba Cloud’s BaiLian platform; open-weight release was still pending as of mid-May.

Microsoft’s April 14 release, MAI-Image-2-Efficient, was the quieter of the three. It’s a distilled, leaner sibling of the existing MAI-Image-2: same family, 22% faster generation and 4× the GPU throughput on H100s. The interesting part isn’t the model; it’s the pricing, which lands at roughly $19.50 per million image-output tokens. Image generation by the token, billed alongside text, is now a normal API shape. It used to be per-image.

And then OpenAI on the 21st with the reasoning model. The order matters: by the time gpt-image-2 hit the arena, the rest of the field had already been compressed inside about 200 Elo. The reasoning loop opened a fresh gap.

What got better: everyone’s old model also leveled up

Below the headline launches, four shipping products got meaningfully better in the same window.

Adobe Firefly Image 5 went generally available on March 19, with a native 4MP output resolution (around 2240×1792), better-behaved hands and anatomy, and a private beta of Layered Image Editing: the model decomposes a generated image into discrete, editable layers instead of returning a flat raster. Custom-model training, where the model learns a personal or brand visual style from a small set of references, also expanded out of private beta.

Midjourney V8 Alpha opened on March 17. The headline numbers are roughly 5× faster generation, native 2K via --hd, a new --q 4 quality mode aimed at complex scenes, and better text rendering: V7 was the textless one; V8 catches up to the field. It’s alpha; V7 remains the default until V8 stabilises.

Black Forest Labs rolled out FLUX.1.2 Pro Ultra in February as the new high end of the FLUX family: 4-megapixel photoreal at, in their numbers, ten times the speed of the previous Ultra tier. The bigger architectural shift, FLUX.2 [klein], landed on January 15 (just before the window) in 4B and 9B variants, with FP8 and NVFP4 quantised builds that cut VRAM by up to 55%. That’s the most consequential open-weight news in months.

xAI’s Grok Imagine Quality Mode went live for enterprise developers on May 4, unifying image and video under one Imagine endpoint and pricing image generation at sub-cent per call. Around the same time, Gemini 2.5 Flash Image moved from preview to general availability, and Google’s November-2025 release Nano Banana Pro kept extending: a 14-image reference context, up to 4K, and the strongest in-image multilingual typography in the field.

A salon-style wall of assorted framed pictures in different sizes, hung edge to edge. — The frontier looks more like a salon wall than a single masterpiece. Photo by Andrew Neel on Unsplash.

Under the hood: the technique stack converged

Catalog aside, the more interesting story is how much the underlying techniques converged across the labs in three months.

Reasoning before drawing is the headline. OpenAI shipped it first in production, but the underlying recipe (plan, retrieve, sample, verify) has been kicking around the literature since 2024 and is exactly the same loop that turned base LLMs into reasoning models the year before. Expect at least two labs to match it by August.

Unified generation and editing is the second. Qwen-2.0 and Nano Banana Pro both ship as one model that handles text-to-image, image-to-image, and inpainting through the same forward pass: no second “edit” model, no separate ControlNet chain. The old division of labour between “generation backbone” and “editing adapter” is on the way out for new releases.

Native resolution at 2K and 4K is the third. Firefly 5, Qwen-2.0, Midjourney V8, FLUX.1.2 Pro Ultra and Nano Banana Pro all ship with native multi-megapixel output (no upscaler in the chain) with detail preserved at the texture level (skin pores, fabric weave, fine type). Upscalers have moved from default to fallback.

Mixture-of-Experts at image scale is the fourth, and the most architectural. Tencent’s HunyuanImage 3.0-Instruct, the 80B-parameter open-source MoE released January 26, sits right outside our window but defines its scale : image generation is following the same MoE-and-routing trajectory as text. The dense-transformer assumption is loosening across the field.

A black camera lens lying on a brown wooden table, side view, shallow depth of field. — Under the hood, the technique stack converged faster than the catalog. Photo by Jason Leung on Unsplash.

Trend lines: four patterns visible across Q2

Across the catalog, four things rhyme. None of them are obvious from any single launch.

1. The frontier compressed. Through most of 2025 there was a clear top three with the rest of the field a clear step behind. On the May 19 arena snapshot, models four through ten sit inside about 100 Elo of each other, close enough that for any given prompt, which one wins is more about taste and prompt-fit than capability. The only model with a real lead is gpt-image-2, and only because it reasons.

2. Per-image cost crashed. A year ago a Midjourney V6 run cost an effective $0.05–$0.10 of compute. Grok Imagine Image now sits at roughly $0.02 per call, Microsoft is billing by the token, and the FLUX quantised builds will hit a local M4 Mac under a second per image. The economics of image generation are now closer to thumbnail generation than they are to a creative render.

3. Typography stopped being a tell. The classic “is this AI” giveaway (garbled letterforms in signage, posters, infographics) mostly went away. Qwen-2.0 accepts thousand-token prompts for text-heavy outputs. Nano Banana Pro renders legible multilingual type. Firefly 5 and gpt-image-2 both produce poster-grade letterforms. If you ship marketing visuals, the 2025-era prompt of “please don’t add text” is now an anti-pattern.

4. The picker problem got worse. Five top-tier contenders that are all good enough is harder to choose between than one obvious leader and a long tail. This is the case made in you don’t need every AI model and the practical answer hasn’t changed: pick one editing flagship and one generation flagship, hard-code defaults at the application layer, and don’t expose a 12-model dropdown to your users.

Quiet quarter for the categories that weren’t

Three places were unusually silent for a quarter this active. Stability AI did not ship a new headline SD model in the window. Its public 2026 work was concentrated in 4D video, the AMD-optimised ONNX builds, and a Virtual Camera research preview: useful, not category-defining. On-device flagship image generation remained mostly absent; FLUX.2 [klein] runs locally and Apple’s Image Playground continues to improve, but no major lab shipped a 4MP, sub-second local model in the window. Open-weight at the frontier still trails closed by a noticeable margin: HunyuanImage 3.0-Instruct is the only true open-source contender on the arena top ten, and it sits at #5 for editing, not text-to-image.

What to watch May to August

Three things are queued up for next quarter and worth marking on a calendar.

Open-weight reasoning image models. Now that OpenAI has shipped one, the open-weight labs almost certainly have a recipe in flight. The technique is conceptually portable; the question is whether a Hunyuan or a Qwen ships it before Imagen or Midjourney does.

A FLUX text-to-video model. Black Forest Labs confirmed in February that video is under development. Given how quickly FLUX has eaten the open-weight image lane, a FLUX video model would put real pressure on Veo, Sora, Kling, and the Chinese video front-runners. No date.

Native edit-anything in Photoshop and Premiere. Adobe’s Layered Image Editing beta is the start of a much bigger retrofit; expect it to land across Creative Cloud surfaces in the next few months, and to put pressure on every standalone editing tool whose differentiator was layered output.

The leaderboard, as of May 19

For the calibration. The top five are tight; everyone outside the top five is, for most jobs, within rounding error of each other.

Artificial Analysis Text-to-Image Arena · top 5, May 19 2026

Model

Lab

Elo

GPT Image 2 (high)

OpenAI

1338

GPT Image 1.5 (high)

OpenAI

1267

Nano Banana 2 (Gemini 3.1 Flash Image)

Google

1263

Nano Banana Pro (Gemini 3 Pro Image)

Google

1222

MAI-Image-2

Microsoft

1196

The same image you’d generate from any of these five would look roughly fine. The reason to choose between them is no longer raw quality : it’s shape of editor, typography behaviour, latency, licensing, and how much of the work the model will do for you before you have to step in. That makes this a curation question more than a capability one, which is the same shift the text-model world went through two years ago. The companion piece on which modality to reach for is the next read if you’re putting image generation into a real product.

Next installment in this series: The last three months in AI: Text models. Aug 2026.