The last three months in AI: Audio models
Realtime voice latency fell under 200ms in four weeks. Music spent the quarter in court. Five sub-modalities, five clocks.
Two stories from the same quarter set the frame. On May 7, 2026 OpenAI took its Realtime API out of beta and shipped three new audio models in one go: gpt-realtime-2 with GPT-5-class reasoning over speech, gpt-realtime-translate at $0.034 a minute, and gpt-realtime-whisper at $0.017. The same week, the floor on a streaming voice agent — first audio out, full duplex, GPT-grade understanding — landed below 200ms. A year ago it was closer to 800.
Six days before that, on May 1, Udio admitted in a court filing that it had used yt-dlp to scrape audio from YouTube for training. That single filing reframes the AI-music lawsuits from a fair-use argument into a DMCA §1201 circumvention argument — a much narrower defense. Suno is fighting the same battle in Massachusetts, with a summary-judgment hearing scheduled for July. That is the quarter in two sentences: realtime voice consolidated and got cheaper, and the music side spent it in court. If you only have an hour for an audio-AI update, spend it with the OpenAI Realtime API GA, the Sony briefings, and the Stable Audio 3.0 launch post. Those three explain almost everything else.

Two clocks: voice on a sprint, music on hold
Audio is not one market. Music generation, text-to-speech, speech-to-text, real-time conversational voice, sound effects, and dubbing barely share customers, much less roadmaps. Treating “AI audio” as one bucket is the mistake every flat industry tracker makes. The hero on this page is a five-lane mixer because that is what the catalog actually looks like in May 2026 — five separate clocks, four of them running fast.
Voice sped up because the labs collectively agreed on a recipe. A streaming audio tokenizer in front, a transformer in the middle, a streaming decoder out the back, and the whole thing exposed through one bidirectional WebSocket API. Cartesia, ElevenLabs, AssemblyAI, OpenAI, and Kyutai all shipped variants of that pattern between January and May. Music slowed down because the recipe is the same as it was twelve months ago and the legal exposure is now an order of magnitude higher. Suno’s v6, promised by Warner’s November 2025 press release, still has not shipped. What did ship from Suno in the window was v5.5 (March 26), a Studio update with multi-stem export, and a Series D rumored at over $5 billion. The model itself did not move.
What shipped: the five biggest releases
Five launches define the window. Each is in a different sub-modality. Read them as a tour of the lanes on the mixer, not as a ranked list.
OpenAI Realtime API GA, May 7. The headline is gpt-realtime-2: a single model that takes audio in, reasons over it with GPT-5-class effort levels, and returns streamed audio with native function calling and image-input support, against a 128K context window and a +15-point lift on the Audio MultiChallenge benchmark at high reasoning effort. Pricing is $32 per million audio-input tokens ($0.40/M cached) and $64 per million audio-output tokens. Alongside it, gpt-realtime-translate runs live speech translation across 70+ input languages into 13 output languages at $0.034/min, and gpt-realtime-whisper is streaming STT at $0.017/min. One API endpoint, three jobs, all production-grade. The contemporaneous coverage called it the moment voice stopped being a feature toggle. It is.
Cartesia Sonic-3 on AWS SageMaker JumpStart, February. Sonic-3 was announced in late 2025 alongside Cartesia’s $100M Series; the SageMaker drop made it deployable in a VPC with one click. The headline number is roughly 40ms time-to-first-audio, ~190ms end-to-end, across 42 languages, on a state-space-model backbone rather than a transformer. Enterprise voice agents that need to stay on-premise have a credible sub-100ms option for the first time.
ElevenLabs Eleven v3 GA, February 2.ElevenLabs spent Q1 hardening its TTS flagship: 70+ languages, audio-tag emotion control markup (whisper, laugh, frustrated), a claimed 68% drop in error rate on complex text, and explicit positioning as the batch model — Flash v2.5 remains the realtime variant at roughly 75ms. Two days later the company closed a $500M Series D at an $11B valuation. The same company also shipped Scribe v2 Realtime (Jan 6), Sound Effects v2, and a fully-licensed music platform on April 30 with Kobalt and Merlin deals. The cross-section of customers ElevenLabs now competes for — podcasters, voice-agent builders, dubbers, music creators — explains the valuation multiple.
Stable Audio 3.0, May 20. Stability shipped a four-model family on the last day of the window: a 459M Small SFX model, a 459M Small music model, a 1.4B Medium, and a 2.7B Large. Medium and Large generate structured compositions up to 6 minutes 20 seconds. Small, Small SFX, and Medium are open weights; Large is API-only. The corpus is licensed through Warner/UMG agreements — the first long-form open-weight music model with a clean commercial-use story. Studios with a $1M revenue threshold need an enterprise license.
Google Lyria 3 and Lyria 3 Pro. Lyria 3 landed in the Gemini app on February 18 with 30-second clips; Lyria 3 Pro followed on March 25 with 3-minute tracks, structural prompts (intro, verse, chorus, bridge), 44.1 kHz stereo, and SynthID watermarking. Available in Vertex AI, AI Studio, the Gemini API, Google Vids, and ProducerAI — the latter being the Riffusion-rebrand Google quietly acquired in late February. Google is now the only frontier lab with a music model, a video model, a text model, and a real-time voice model all on the same Gemini-app surface. None of the others can say the same.

What got better: the workflow stack caught up to the models
Below the headline launches, the supporting cast levelled up. The pattern: tools that used to feel like demos started shipping like software.
AssemblyAI Universal-3 Pro launched February 3 as a promptable STT model — the first to instruction-tune transcription behavior. Native code-switching across six languages, with Whisper fallback to 99 more. The streaming variant arrived March 3 with P50 ~150ms after voice activity detection. List price is $0.21 an hour on the base tier.
OpenAI gpt-4o-transcribe and -mini. Released April 27 and May 1 respectively, at $0.006/min and $0.003/min. Whisper Large-v3 is still the public OpenAI baseline; these are the models OpenAI actually recommends now.
YouTube Auto-Dubbing went global to every creator on February 4 — 27 languages, “Expressive Speech” pitch matching in 8 of them, and a lip-sync pilot underway. By mid-spring, more than 6 million daily viewers were watching more than 10 minutes of auto-dubbed content. This is the largest deployment of generative audio anywhere on the public internet, and it crossed the consumer-default line without a press tour.
iZotope RX 12 launched April 29 with Scene Rebalance, which separates a film or television scene into dialogue, music, and effects stems for individual remixing — previously a manual job for a re-recording mixer. Music Rebalance became a real-time plug-in. Adobe’s March Podcast update added downloadable stems to Enhance Speech and integrated multitrack import for Zoom and Riverside sessions. The audio-post stack is now AI-decomposable end-to-end.
Suno Studio 1.2 (February) added Warp Markers, quantize, Alternates take stacking, non-4/4 time signatures, and scaled stem separation to 12 lanes with MIDI export. Suno v5.5 shipped March 26 with personal voice capture, fine-tuning on your own tracks, and a taste model. The Suno catalog moved sideways while the studio around it moved forward — a deliberate strategy while the v6 successor is held back pending the Sony case.
Under the hood: four shifts that landed in one quarter
Audio tokenization is the new bottleneck.The practical difference between an 800ms voice agent and a 200ms one is not the LLM — it is the codec. Cartesia’s state-space model, Kyutai’s Mimi codec, and OpenAI’s in-house token path each unbundle audio into a stream of discrete tokens that a transformer can emit and consume as it generates, instead of an encode-decode round trip per chunk. Token-streaming audio is the equivalent of streaming text tokens for chat models, and it lands at roughly the same time in product terms.
One API, three audio jobs.The OpenAI Realtime API GA consolidates conversational voice, translation, and transcription behind a single bidirectional endpoint. Cartesia’s Voice Platform, ElevenLabs Agents, and Hume EVI are all moving the same way. The era of plumbing together a separate STT, LLM, and TTS for every voice product is over for the default case. Specialized stacks still win on price and latency in narrow contexts, but the default reach moved up one layer.
Stems are first-class output.Suno Studio splits twelve, Stable Audio Small can emit foley separately, LALAL.AI shipped an on-device DAW plug-in in February, and iZotope’s RX 12 made scene-stem decomposition a one-click operation. The unit of generated audio is no longer the mixed track — it is the multitrack. Anyone shipping AI music or post-production tooling that does not treat stems as a first-class output is now a generation behind.
On-device crossed a real line. Kyutai Pocket TTS (100M parameters, 6× realtime on a MacBook Air CPU, multilingual in May), Moonshine v2’s 26–245MB streaming STT models, Stable Audio 3.0’s 459M music model, and Speechmatics’ on-device STT for Adobe Premiere all landed in the window. The companion piece on personal compute moving back to the desktop now reads with extra weight on the audio side — not just LLMs.

Trend lines: four patterns across the quarter
Read the catalog above and four threads recur. None are obvious from any single launch.
1. Realtime voice latency collapsed. The time-to-first-audio table at the top of this section is the headline: ~40ms (Cartesia), 80ms (Scribe), 150ms (AssemblyAI), 190ms (OpenAI), all production-grade, all shipped between January 6 and May 7. Anything above 300ms now reads as legacy. The implication for product is large: the latency excuses for not building voice agents into existing software have run out.
2. Labels stopped suing, started shipping. Warner settled with Suno in November 2025. UMG settled with Udio in October 2025 and is co-building the licensed platform now in staged rollout. Stable Audio 3.0 trains on a licensed corpus. ElevenLabs shipped ElevenMusic with Kobalt and Merlin licensing. Sony is the holdout. The July 2026 summary-judgment hearing in Sony v Suno will set fair-use precedent for the rest of the field. The same regulatory force is reshaping every other modality — the companion piece on regulations eating cloud AI applies to music with extra force.
3. ElevenLabs is the audio incumbent now.Series D at $11B, a flagship TTS GA, a realtime STT, a music platform, a sound effects model, and an agents product — all in one quarter, all on one credentialed surface. ElevenLabs now competes simultaneously with OpenAI, Suno, AssemblyAI, HeyGen, and Cartesia. No other audio company touches all five lanes.
4. Watermarking moved from research to default. At Google I/O 2026, SynthID had tagged 60,000 years of cumulative AI audio and was rolling into Google Search and Chrome, with OpenAI, ElevenLabs, and Kakao signed on as adopters. Spotify shipped an AI-credit disclosure beta on April 17 using DDEX fields. Voluntary today; the EU AI Act’s August deadline will make at least the disclosure mandatory.
Quiet quarter for
Four expected shoes did not drop. Suno v6 was promised by November 2025 press releases and has not shipped; v5.5 is the substitute. OpenAIreleased no Whisper v4 — the new bets are gpt-4o-transcribe and the streaming whisper inside the Realtime API. Anthropic shipped a voice mode for Claude Code in early March but no standalone Claude voice model and no speech-to-speech API; mobile Claude voice remains English-only. Deepgram stayed on Nova-3 and shipped no Nova-4. The pattern: the players who were leading the audio conversation in 2024 and 2025 are the ones with the quietest Q2 2026 releases.
What to watch May to August
Four calendar items to keep an eye on.
Sony v Suno summary judgment, July.Suno’s transformative-fair-use motion, citing the 2024 Bartz v SoundAI precedent, gets argued in a Massachusetts district court. A win for Suno legitimises model-training on copyrighted recordings for the whole field. A loss collapses the open-weight music tail into the licensed-corpus tier. Either ruling reshapes the next quarter’s catalog.
The Udio – UMG walled garden, formal launch.The licensed platform is in staged rollout; the consumer launch is still pending, with user backlash already brewing over the no-download restriction that defines the “walled garden” design.
Suno v6 or a successor. The model that was supposed to ship under the Warner deal is now eight months overdue. When it arrives, watch the licensing language as much as the audio.
EU AI Act audio provisions, August 2. The same deadline that drives C2PA adoption on the video side requires synthetic-audio disclosure for any AI-generated content served to EU users. Voice-cloning consent obligations come along for the ride.
The leaderboard, by job
For the working operators. Picks reflect the May 20 snapshot — commercial-use posture, latency, language coverage, and pricing weighted, not just blind-test quality.
Any one of the picks above would do a competent job today. The decisions that matter now are commercial-use posture (especially for music), language coverage, and whether the model can run inside your deployment boundary — the case made in you don’t need every AI model. Companion pieces this quarter: the text installment, the image installment, and the video installment; together they cover the same Feb 20 to May 20 window across the four modalities most teams now ship across every week.
Next installment in this series: The last three months in AI — August 2026.


