NewsAI ModelsAudio AIMay 22, 202613 min read

The last three months in AI: Audio models

Realtime voice latency fell under 200ms in four weeks. Music spent the quarter in court. Five sub-modalities, five clocks.

By Atul

Spring 2026 · Audio models

Feb 20 to May 20

Voice sped up. Music slowed down in court. Five sub-modalities, five clocks.

Realtime voice

First-audio latency under 200ms became the default

Text-to-speech

Open weights crossed the CPU line on a MacBook Air

Speech recognition

Streaming STT at $0.003 a minute, sub-150ms latency

Music

Labels stopped suing and started shipping, except Sony

Foley & dub

YouTube auto-dub went global to every creator

Two stories from the same quarter set the frame. On May 7, 2026 OpenAI took its Realtime API out of beta and shipped three new audio models in one go: gpt-realtime-2 with GPT-5-class reasoning over speech, gpt-realtime-translate at $0.034 a minute, and gpt-realtime-whisper at $0.017. The same week, the floor on a streaming voice agent (first audio out, full duplex, GPT-grade understanding) landed below 200ms. A year ago it was closer to 800.

Six days before that, on May 1, Udio admitted in a court filing that it had used yt-dlp to scrape audio from YouTube for training. That single filing reframes the AI-music lawsuits from a fair-use argument into a DMCA §1201 circumvention argument: a much narrower defense. Suno is fighting the same battle in Massachusetts, with a summary-judgment hearing scheduled for July. That is the quarter in two sentences: realtime voice consolidated and got cheaper, and the music side spent it in court. If you only have an hour for an audio-AI update, spend it with the OpenAI Realtime API GA, the Sony briefings, and the Stable Audio 3.0 launch post. Those three explain almost everything else.

A person at the faders of an analogue mixing console. — Five sub-modalities, one console. Photo by Drew Patrick Miller on Unsplash.

Two clocks: voice on a sprint, music on hold

Audio is not one market. Music generation, text-to-speech, speech-to-text, real-time conversational voice, sound effects, and dubbing barely share customers, much less roadmaps. Treating “AI audio” as one bucket is the mistake every flat industry tracker makes. The hero on this page is a five-lane mixer because that is what the catalog actually looks like in May 2026: five separate clocks, four of them running fast.

Voice sped up because the labs collectively agreed on a recipe. A streaming audio tokenizer in front, a transformer in the middle, a streaming decoder out the back, and the whole thing exposed through one bidirectional WebSocket API. Cartesia, ElevenLabs, AssemblyAI, OpenAI, and Kyutai all shipped variants of that pattern between January and May. Music slowed down because the recipe is the same as it was twelve months ago and the legal exposure is now an order of magnitude higher. Suno’s v6, promised by Warner’s November 2025 press release, still has not shipped. What did ship from Suno in the window was v5.5 (March 26), a Studio update with multi-stem export, and a Series D rumored at over $5 billion. The model itself did not move.

What shipped: the five biggest releases

Five launches define the window. Each is in a different sub-modality. Read them as a tour of the lanes on the mixer, not as a ranked list.

OpenAI Realtime API GA, May 7. The headline is gpt-realtime-2: a single model that takes audio in, reasons over it with GPT-5-class effort levels, and returns streamed audio with native function calling and image-input support, against a 128K context window and a +15-point lift on the Audio MultiChallenge benchmark at high reasoning effort. Pricing is $32 per million audio-input tokens ($0.40/M cached) and $64 per million audio-output tokens. Alongside it, gpt-realtime-translate runs live speech translation across 70+ input languages into 13 output languages at $0.034/min, and gpt-realtime-whisper is streaming STT at $0.017/min. One API endpoint, three jobs, all production-grade. The contemporaneous coverage called it the moment voice stopped being a feature toggle. It is.

Cartesia Sonic-3 on AWS SageMaker JumpStart, February. Sonic-3 was announced in late 2025 alongside Cartesia’s $100M Series; the SageMaker drop made it deployable in a VPC with one click. The headline number is roughly 40ms time-to-first-audio, ~190ms end-to-end, across 42 languages, on a state-space-model backbone rather than a transformer. Enterprise voice agents that need to stay on-premise have a credible sub-100ms option for the first time.

ElevenLabs Eleven v3 GA, February 2. ElevenLabs spent Q1 hardening its TTS flagship: 70+ languages, audio-tag emotion control markup (whisper, laugh, frustrated), a claimed 68% drop in error rate on complex text, and explicit positioning as the batch model. Flash v2.5 remains the realtime variant at roughly 75ms. Two days later the company closed a $500M Series D at an $11B valuation. The same company also shipped Scribe v2 Realtime (Jan 6), Sound Effects v2, and a fully-licensed music platform on April 30 with Kobalt and Merlin deals. The cross-section of customers ElevenLabs now competes for (podcasters, voice-agent builders, dubbers, music creators) explains the valuation multiple.

Stable Audio 3.0, May 20. Stability shipped a four-model family on the last day of the window: a 459M Small SFX model, a 459M Small music model, a 1.4B Medium, and a 2.7B Large. Medium and Large generate structured compositions up to 6 minutes 20 seconds. Small, Small SFX, and Medium are open weights; Large is API-only. The corpus is licensed through Warner/UMG agreements: the first long-form open-weight music model with a clean commercial-use story. Studios with a $1M revenue threshold need an enterprise license.

Google Lyria 3 and Lyria 3 Pro. Lyria 3 landed in the Gemini app on February 18 with 30-second clips; Lyria 3 Pro followed on March 25 with 3-minute tracks, structural prompts (intro, verse, chorus, bridge), 44.1 kHz stereo, and SynthID watermarking. Available in Vertex AI, AI Studio, the Gemini API, Google Vids, and ProducerAI: the latter being the Riffusion-rebrand Google quietly acquired in late February. Google is now the only frontier lab with a music model, a video model, a text model, and a real-time voice model all on the same Gemini-app surface. None of the others can say the same.

Close-up of a vintage broadcast microphone with a backlit silhouette behind it. — The TTS uncanny line moved enough that newsrooms can hand a script to a voice. Photo by Elijah Crouch on Unsplash.

What got better: the workflow stack caught up to the models

Below the headline launches, the supporting cast levelled up. The pattern: tools that used to feel like demos started shipping like software.

AssemblyAI Universal-3 Pro launched February 3 as a promptable STT model: the first to instruction-tune transcription behavior. Native code-switching across six languages, with Whisper fallback to 99 more. The streaming variant arrived March 3 with P50 ~150ms after voice activity detection. List price is $0.21 an hour on the base tier.

OpenAI gpt-4o-transcribe and -mini. Released April 27 and May 1 respectively, at $0.006/min and $0.003/min. Whisper Large-v3 is still the public OpenAI baseline; these are the models OpenAI actually recommends now.

YouTube Auto-Dubbing went global to every creator on February 4: 27 languages, “Expressive Speech” pitch matching in 8 of them, and a lip-sync pilot underway. By mid-spring, more than 6 million daily viewers were watching more than 10 minutes of auto-dubbed content. This is the largest deployment of generative audio anywhere on the public internet, and it crossed the consumer-default line without a press tour.

iZotope RX 12 launched April 29 with Scene Rebalance, which separates a film or television scene into dialogue, music, and effects stems for individual remixing, previously a manual job for a re-recording mixer. Music Rebalance became a real-time plug-in. Adobe’s March Podcast update added downloadable stems to Enhance Speech and integrated multitrack import for Zoom and Riverside sessions. The audio-post stack is now AI-decomposable end-to-end.

Suno Studio 1.2 (February) added Warp Markers, quantize, Alternates take stacking, non-4/4 time signatures, and scaled stem separation to 12 lanes with MIDI export. Suno v5.5 shipped March 26 with personal voice capture, fine-tuning on your own tracks, and a taste model. The Suno catalog moved sideways while the studio around it moved forward: a deliberate strategy while the v6 successor is held back pending the Sony case.

Under the hood: four shifts that landed in one quarter

Realtime audio latency · the quarter’s floor

Sonic-3 (AWS)

Cartesia · Time-to-first-audio · Feb 2026 · $25 / M chars

40 ms

Scribe v2 Realtime (STT)

ElevenLabs · P50 latency · Jan 6, 2026 · $0.22 / hr

80 ms

Universal-3 Pro Streaming

AssemblyAI · P50 latency · Mar 3, 2026 · $0.45 / hr

150 ms

gpt-realtime-2

OpenAI · First-audio latency (typical) · May 7, 2026 · $64 / M audio-out

190 ms

Claude Code voice (wrapper)

Anthropic · End-to-end (push-to-talk) · Mar 3, 2026 · Bundled with Code

380 ms

Bar width is proportional to latency: shorter bars are faster. Numbers are vendor-reported on their own benchmark; treat as a rough ordering, not a head-to-head. Twelve months ago the same column for production voice agents was clustered around 500–800ms.

Audio tokenization is the new bottleneck. The practical difference between an 800ms voice agent and a 200ms one is not the LLM; it is the codec. Cartesia’s state-space model, Kyutai’s Mimi codec, and OpenAI’s in-house token path each unbundle audio into a stream of discrete tokens that a transformer can emit and consume as it generates, instead of an encode-decode round trip per chunk. Token-streaming audio is the equivalent of streaming text tokens for chat models, and it lands at roughly the same time in product terms.

One API, three audio jobs. The OpenAI Realtime API GA consolidates conversational voice, translation, and transcription behind a single bidirectional endpoint. Cartesia’s Voice Platform, ElevenLabs Agents, and Hume EVI are all moving the same way. The era of plumbing together a separate STT, LLM, and TTS for every voice product is over for the default case. Specialized stacks still win on price and latency in narrow contexts, but the default reach moved up one layer.

Stems are first-class output. Suno Studio splits twelve, Stable Audio Small can emit foley separately, LALAL.AI shipped an on-device DAW plug-in in February, and iZotope’s RX 12 made scene-stem decomposition a one-click operation. The unit of generated audio is no longer the mixed track; it is the multitrack. Anyone shipping AI music or post-production tooling that does not treat stems as a first-class output is now a generation behind.

On-device crossed a real line. Kyutai Pocket TTS (100M parameters, 6× realtime on a MacBook Air CPU, multilingual in May), Moonshine v2’s 26–245MB streaming STT models, Stable Audio 3.0’s 459M music model, and Speechmatics’ on-device STT for Adobe Premiere all landed in the window. The companion piece on personal compute moving back to the desktop now reads with extra weight on the audio side, not just LLMs.

Closed-back studio headphones resting over a condenser microphone on a desk. — Stems, plug-ins, on-device transcription. The amateur audio pipeline now looks a lot like the pro one. Photo by Will Francis on Unsplash.

Trend lines: four patterns across the quarter

Read the catalog above and four threads recur. None are obvious from any single launch.

1. Realtime voice latency collapsed. The time-to-first-audio table at the top of this section is the headline: ~40ms (Cartesia), 80ms (Scribe), 150ms (AssemblyAI), 190ms (OpenAI), all production-grade, all shipped between January 6 and May 7. Anything above 300ms now reads as legacy. The implication for product is large: the latency excuses for not building voice agents into existing software have run out.

2. Labels stopped suing, started shipping. Warner settled with Suno in November 2025. UMG settled with Udio in October 2025 and is co-building the licensed platform now in staged rollout. Stable Audio 3.0 trains on a licensed corpus. ElevenLabs shipped ElevenMusic with Kobalt and Merlin licensing. Sony is the holdout. The July 2026 summary-judgment hearing in Sony v Suno will set fair-use precedent for the rest of the field. The same regulatory force is reshaping every other modality: the companion piece on regulations eating cloud AI applies to music with extra force.

3. ElevenLabs is the audio incumbent now. Series D at $11B, a flagship TTS GA, a realtime STT, a music platform, a sound effects model, and an agents product, all in one quarter, all on one credentialed surface. ElevenLabs now competes simultaneously with OpenAI, Suno, AssemblyAI, HeyGen, and Cartesia. No other audio company touches all five lanes.

4. Watermarking moved from research to default. At Google I/O 2026, SynthID had tagged 60,000 years of cumulative AI audio and was rolling into Google Search and Chrome, with OpenAI, ElevenLabs, and Kakao signed on as adopters. Spotify shipped an AI-credit disclosure beta on April 17 using DDEX fields. Voluntary today; the EU AI Act’s August deadline will make at least the disclosure mandatory.

Quiet quarter for

Four expected shoes did not drop. Suno v6 was promised by November 2025 press releases and has not shipped; v5.5 is the substitute. OpenAI released no Whisper v4: the new bets are gpt-4o-transcribe and the streaming whisper inside the Realtime API. Anthropic shipped a voice mode for Claude Code in early March but no standalone Claude voice model and no speech-to-speech API; mobile Claude voice remains English-only. Deepgram stayed on Nova-3 and shipped no Nova-4. The pattern: the players who were leading the audio conversation in 2024 and 2025 are the ones with the quietest Q2 2026 releases.

What to watch May to August

Four calendar items to keep an eye on.

Sony v Suno summary judgment, July. Suno’s transformative-fair-use motion, citing the 2024 Bartz v SoundAI precedent, gets argued in a Massachusetts district court. A win for Suno legitimises model-training on copyrighted recordings for the whole field. A loss collapses the open-weight music tail into the licensed-corpus tier. Either ruling reshapes the next quarter’s catalog.

The Udio – UMG walled garden, formal launch. The licensed platform is in staged rollout; the consumer launch is still pending, with user backlash already brewing over the no-download restriction that defines the “walled garden” design.

Suno v6 or a successor. The model that was supposed to ship under the Warner deal is now eight months overdue. When it arrives, watch the licensing language as much as the audio.

EU AI Act audio provisions, August 2. The same deadline that drives C2PA adoption on the video side requires synthetic-audio disclosure for any AI-generated content served to EU users. Voice-cloning consent obligations come along for the ride.

The leaderboard, by job

For the working operators. Picks reflect the May 20 snapshot: commercial-use posture, latency, language coverage, and pricing weighted, not just blind-test quality.

Current best for the job · May 20, 2026

Job

Pick

Why

Sub-200ms voice agent

OpenAI gpt-realtime-2

GPT-5-class reasoning, 128K context, $64/M audio-out

Podcast / audiobook TTS

ElevenLabs Eleven v3

70+ languages, audio tags for emotion, batch-only

TTS on a laptop CPU

Kyutai Pocket TTS

100M params, ~6× realtime on MacBook Air M4

Streaming STT (cheap)

gpt-4o-mini-transcribe

$0.003 / min, April–May 2026 release

Streaming STT (accurate)

AssemblyAI Universal-3 Pro

P50 ~150ms, real-time diarization, 99+ languages

Long-form music (commercial)

Stable Audio 3.0 Large

Up to 6:20, licensed corpus, API-only

Music with stems & DAW

Suno Studio 1.2

12-stem split, MIDI export, odd time signatures

Video-conditioned foley

Tencent HunyuanVideo-Foley

Open weights, SOTA temporal alignment

Multilingual dubbing

ElevenLabs Dubbing

32 languages, speaker-preserved voices

One-click YouTube dub

YouTube Auto-Dubbing

27 languages, free for every creator since Feb 4

Any one of the picks above would do a competent job today. The decisions that matter now are commercial-use posture (especially for music), language coverage, and whether the model can run inside your deployment boundary: the case made in you don’t need every AI model. Companion pieces this quarter: the text installment, the image installment, and the video installment; together they cover the same Feb 20 to May 20 window across the four modalities most teams now ship across every week.

Next installment in this series: The last three months in AI, August 2026.