Field GuideMultimodalModelsMay 5, 202611 min read

Text, image, audio, video — when to reach for which model (and how to chain them)

A 2026 field guide — what each modality is good at, what it costs, and three ways to chain them together.

By Atul

Text

GPT-5.5 · Claude Opus 4.7

$2 / $12 · per Mtok

Image

Flux 1.1 Pro · Imagen 4

$0.04 · per image

Audio

Whisper · ElevenLabs v3

$0.006 · per minute

Video

Veo 3 · Sora 2

$0.02–0.60 · per sec

Models in a median enterprise pipeline (a16z, 2026)

$0.30

60-min interview → blog post, end to end

Text models are converging on three or four big APIs. Media — image, audio, video — is doing the opposite. Andreessen Horowitz’s State of Generative Media 2026 report finds a typical enterprise pipeline now uses a median of 14 different image/video modelsin production. The right answer for “which model do I use?” depends on the task, the budget, the latency tolerance, and which modality you’re even in. This post is a 2026 field guide across all four — what each is genuinely best at, what it costs, and three recipes for chaining them.

Text

The frontier picks (Artificial Analysis intelligence index) are GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro. They overlap on capability and trade off on personality and cost.

Best raw quality: Claude Opus 4.7 for adaptive reasoning + long-horizon agent work; GPT-5.5 (xhigh) on one-shot research and planning. Both have time-to-first-token in the 25–35 second range, so they’re slow for chat but worth it for structured outputs you don’t need to stream.
Best price/intelligence: Gemini 3.1 Pro at $2 / $12 per million tokens and 109 t/s output — the fastest top-tier model and the cheapest. Default pick if you’re not optimizing for either extreme.
Best for high volume: Gemini 2.5 Flash-Lite ($0.10 / $0.40), Grok 4.1 Fast ($0.20 / $0.50), or local Llama 3.3 70B (the electricity bill).
Reasoning-mode pick: DeepSeek R1 gets within 4 points of o3 on AIME 2024 at roughly 1/18 the price. Use it behind a “think harder” button, never in a chat loop — a single hard math problem can burn 25,000 output tokens.

When to reach local instead of cloud:if you’re looking at $500/month or more in API spend on a workload that a 70B model handles, hosted Llama 3.3 70B via Groq is ~$0.59–0.79 per million tokens. Local on a 64+ GB Mac is the electricity bill plus the up-front hardware. Both are 10–30× cheaper than frontier cloud at sustained volume.

Image

The 2026 image market splits cleanly along four axes — pick by what the image is for, not by which model is “best.”

Photoreal humans & physical scenes: Midjourney V7 won 23 of 30 standardized prompt tests for skin, fabric, and shadow. Imagen 4 Ultra is the closer second and has a real API. Cost: ~$0.04–$0.08/image.
Stylized, brand, vector, and editable typography: Recraft V3 / V4 is the only image model that thinks in vectors and design grids. ~$0.04/image via fal.
Text inside the image: GPT Image 2 (high) and Ideogram 3 are the only frontier picks that reliably render paragraphs without garbling. Even these break past ~80–100 characters of in-image copy.
Speed and price: FLUX.2 [pro] ($0.03 first MP) or Imagen 4 Fast ($0.02/image) — sub-three-second generation at competitive quality. Workhorse choices.
Conversational editing: Nano Banana 2 (Gemini Image) — cheap, fast, and built for “now make the background sunset, now add a hat” iterative loops.

Median frontier 1024×1024 image is $0.03–$0.05 across the catalog. Capability ceiling: hands on cluttered scenes, multi-character scenes with consistent named identities, paragraphs of text, and precise spatial counting (“seven red apples in a 3×3 grid”) all still fail at meaningful rates.

A black and silver studio microphone resting on a live sound mixing board lit by colorful knobs. — Audio is four sub-modalities on one console — TTS, STT, music, and sound effects. Photo by Obi on Unsplash.

Audio

Audio is really four sub-modalities, each with its own leaderboard.

Text-to-speech (TTS). ElevenLabs v3 is the quality leader and the most expensive. Cartesia Sonic wins for real-time agents (sub-100 ms TTFB at roughly a fifth the per-character cost). Azure / Google neural voices win on price at $4–$16 per million characters. Hume is the pick when emotional inflection matters.
Speech-to-text (STT). ElevenLabs Scribe v2 leads on accuracy (2.2% WER), with AssemblyAI Universal-3 Pro close behind (3.3%). Whisper Large v3 hosted on fal at $1.15 per 1,000 minutes is the price leader. Real-world WER on noisy multi-speaker audio runs 5–10× the benchmark — pad your accuracy expectations accordingly.
Music. Suno v5 is the consumer leader at ~$0.012–$0.016 per song; MiniMax Music 2.0 at $0.03/generation is the cheapest API. Stable Audio 2.5 ($0.20/audio, up to 190 s) is the open-weights pick. Be aware: Suno/Udio still face active RIAA suits filed mid-2024 — if you’re building product, factor in the legal grey zone.
Sound effects. ElevenLabs Sound Effects at 40 credits per second up to 30 seconds dominates here; Mirelo is the niche alternative.

Capability ceiling: TTS still struggles with proper-noun pronunciation in mixed-language sentences, mid-sentence laughs/coughs, and multi-speaker dialogue without explicit voice tags. Plan a human pass on anything destined for production audio.

Video

Video is where the cost story lives. The 30× spread across the 2026 catalog isn’t a typo — it’s the difference between $0.04/s on Hailuo Std and $0.60/s on Veo 3.1 at 4K with audio.

Cost per second of generated video · 2026

15× spread between the cheapest silent option and Veo 4K with audio.

Hailuo Std

$0.04/s

Wan 2.5

$0.05/s

Kling 2.5 Turbo Pro

$0.07/s

Sora 2 (720p)

$0.10/s

Veo 3.1 Fast

$0.15/s

Sora 2 Pro (720p)

$0.30/s

Veo 3.1 (1080p)

$0.40/s

Sora 2 Pro (1024p)

$0.50/s

Veo 3.1 (4K)

$0.60/s

Native audioSilent (audio composed separately)

The picks divide cleanly:

Best quality + native audio: Google Veo 3.1 for dialogue, lip-sync, and physics intuition; Sora 2 / Sora 2 Pro for narrative coherence over 15–25 second clips. Both produce synchronized speech and ambient sound at the model level — everything else in the table is silent and needs a separate audio step.
Best price/quality: Kling 2.5 Turbo Pro / Kling 3.0 and Hailuo Std. Silent output, but at $0.04–$0.07/s a 10-second clip costs less than a coffee.
Best image-to-video (the workflow that actually ships): Kling, Runway Gen-4, and Hailuo. Image-to-video gives you character control and scene continuity that text-to-video can’t, which is why a16z says enterprise pipelines reach for 14 different models — different I2V models for different shot types.

Capability ceiling: generation lengths still cap at 4–25 seconds per clip. Past ~25 seconds, every model drifts — you stitch shorter clips together. Multi-character dialogue with consistent faces is unsolved without I2V seeding. Hand interactions with objects (typing, writing, playing instruments) are still uncanny.

Three recipes for chaining

The interesting work is rarely a single model call. Three concrete recipes that compose modalities, with the actual cost and time budgets you should expect:

Recipe 160-minute interview → blog post

1Audio file (.mp3)· input—
2Transcript· Whisper Large v3 (fal)$0.07
3Structured draft· GPT-5.5 medium$0.22

Total~2 min$0.30

Recipe 2Photo → animated explainer (5s)

1Source photo + prompt· input—
25s clip (image-to-video)· Kling 2.5 Turbo$0.35
312s voiceover· ElevenLabs Flash$0.05
4Mux to MP4· ffmpeg (local)free

Total~2 min$0.40

Recipe 3Research PDF → narrated explainer video (30s)

1PDF text· input—
2Summary· Gemini 2.5 Flash$0.001
3Script· Claude Opus 4.7$0.04
4Voiceover (~30s)· ElevenLabs v3$0.10
530s video w/ audio· Veo 3.1$12.00

Total~5–6 min$12.15

Notice the cost gradient. A pure text+audio chain is cents. Adding image models stays under a dollar. Adding video — especially Veo-class native-audio video — is the single line item that dominates. Two cheap chains and one expensive one isn’t a coincidence; it’s the literal shape of the 2026 unit-economics landscape.

One place to compose them

The a16z report’s 14-model number is the case for an orchestrator. Text spend is concentrated — OpenAI, Anthropic, and Google together hold 89% of enterprise wallet share — but media spend isn’t. You’re going to call FLUX for one shot and Imagen for another, ElevenLabs for voiceover and Cartesia for the realtime agent, Veo for the hero video and Kling for the b-roll. The bottleneck isn’t access to any single model; it’s the glue.

That’s the gap CSuite is built into. The four modality workspaces share one model picker, one project folder, and one workflow engine with typed edges (text, image, audio, video — the edge type tells you what can connect to what). Six node types — input, model, edit, combine, file, note — let you wire any of the recipes above without writing glue code. Cloud providers run with your own API keys; local providers (Ollama + HuggingFace runtime) cover the long tail and the “run nothing through the cloud” case.

We didn’t set out to build a one-tool-for-everything app. We set out to build the tool we wanted ourselves — one place where text, image, audio, and video share the same files on disk and the same canvas. The 14-model finding suggests that’s a tool a lot of people might want.

Text, image, audio, video — when to reach for which model (and how to chain them)

Text

Image

Audio

Video

Three recipes for chaining

One place to compose them

Choosing a local model in 2026: a flowchart

AI for students who don't want to cheat

Offline AI is more useful than you think

One-time payment. Yours forever.