TutorialVideo AIMarketingMay 26, 202613 min read

How to make a 30-second AI video ad, end-to-end

Forty-six minutes. Four dollars and fifty-eight cents. Every prompt, every tool, every cost: one vertical ad on a Sunday afternoon.

By Atul

Six beats · thirty seconds · one afternoon

~$4.80 · 46 minutes

A polished 30-second ad, built on a laptop, on a Sunday afternoon, for less than the price of a coffee.

Beat 01

0–3s

Hook: hand opens mailbox, finds the bag

Sora 2 (i2v)

Beat 02

3–9s

Bag close-up, label rotates into focus

Veo 3.1 Fast

Beat 03

9–15s

Beans into grinder, slow-mo pour

Veo 3.1 Fast

Beat 04

15–21s

Cup on a kitchen counter, morning light

Sora 2 (i2v)

Beat 05

21–27s

Subscriber holds bag, smiles at camera

Veo 3.1 Fast

Beat 06

27–30s

Logo + CTA card, captioned

CapCut still

Forty-six minutes. Four dollars and fifty-eight cents. One thirty-second vertical ad for a single-origin coffee subscription, ready to upload to Meta, TikTok, and YouTube Shorts. No camera. No actor. No studio. No agency. Just a laptop, a kitchen window, and a stack of AI tools that did not exist as a usable workflow eighteen months ago.

This is the post that walks the whole thing. Every prompt is in here, verbatim. Every tool is named. Every minute is counted, every dollar is itemized. By the end you have a model for the workflow and a recipe you can run against your own product on a Sunday afternoon. If you have ever sat through a five-figure video shoot to produce something a 30-second AI workflow could match, this is the read.

A single-origin coffee bag on a wooden counter beside a paper filter and a small ceramic mug. — The hero product. The ad doesn’t exist without one. Photo by Lex Sirikiat on Unsplash.

One framing note before we start. AI ads are not magic and they are not free. They are cheap, fast, and surprisingly controllable, if you treat them as a craft. Most of the time savings show up because you skip the parts of a traditional shoot that don’t need a human: scouting, model release forms, scheduling, the second-camera angle that never gets used. The parts you keep are editorial: the brief, the cut, and the judgement about what to keep and what to throw away. AI is a cheap session crew. You are still the director.

The brief is the part nobody wants to write

A 30-second ad is a seven-beat structure. Naming the beats before you open a single tool is the only way to keep the AI from ad-libbing for thirty seconds about nothing. The brief I wrote for this post fits in one paragraph. The product: Bayfield Coffee, a fictional single-origin bag-per-month subscription, $22 a month, ships on the first of each month, beans roasted three days before they leave the warehouse. The audience: home brewers who already own a grinder, who are mildly bored with their current grocery-store bag. The platform: Meta and TikTok first, 9:16 vertical, ≤30 seconds, captioned. The call to action: a promo code BAYFIELD10 for $10 off the first bag.

Three things in that paragraph save the next 45 minutes. The first is the platform: deciding 9:16 vertical now keeps every later step from accidentally drifting into 16:9. The second is the audience-already-owns-a-grinder constraint: it removes a beat (no need to explain what a grinder is) and unlocks the slow-motion-pour beat that actually sells the product. The third is the$10-off CTA: a specific number for the closing card is more useful than “learn more”. Write a brief like this for any product you are about to feed to an AI workflow. It is the highest-leverage twenty minutes in the whole process.

A 30-second ad is seven beats. The LLM is good at the beats.

Open Claude, ChatGPT, or Gemini. The LLM doesn’t matter much for this step, all three are competent. Paste in the brief from above, then ask for a beat sheet. Here is the prompt I used, verbatim:

You are writing a 30-second vertical video ad for Bayfield Coffee,
a $22/month single-origin coffee subscription. Audience: home brewers
who already own a grinder. Platform: Meta and TikTok, 9:16, captioned.
CTA: promo code BAYFIELD10 for $10 off the first bag.

Output: a 6-beat script with these columns for each beat:

1. timestamp range (e.g. 0-3s)
2. one-line visual description (must be filmable as image-to-video)
3. voiceover line (max 8 words, conversational)
4. caption text (max 5 words, key word emphasized)

Constraints:
- Open with a pattern interrupt in the first 1.5 seconds.
- Show the bag in the first 9 seconds.
- End on the promo code, not the brand name.
- No actor faces in close-up (we want product-led, not testimonial).

Two minutes of generation, four minutes of trimming. The first draft had eight beats and a script that was six words too long for a calm voiceover at 28 seconds. The second pass dropped the brand-history beat (nobody cares in three seconds) and tightened the VO. The cuts are still yours: an LLM will give you a competent draft, but it will always give you slightly too much. The right move is to cut beats, not cram them in.

On hook structure: TikTok’s own creative team is explicit that 71% of whether a user keeps watching is decided in the first three seconds, and Sprout Social’s 2026 algorithm breakdown confirms the same binary pass-fail at the three-second mark. Beat one earns the rest of the ad. Spend more LLM time on it than on any other beat. The hand-opens-a-mailbox visual we ended up with is a pattern interrupt (ordinary frame, unexpected object), and it is the cheapest part of the whole workflow to iterate on.

Storyboard frames: one per beat, fidelity over creativity

Skip text-to-video on the first pass. Generate a still image for each beat, then animate the still in step three. There are two reasons. First, image generation is roughly thirty times cheaper than video generation per attempt, so you iterate without watching a meter. Second, an image-to-video pass gives you something the text-to-video path does not: a locked-in starting frame the model has to honor.

For product ads, fidelity beats creativity. The model has to render the coffee bag with the right label, the right color, the right typography, not invent something prettier. Three image models are credible in May 2026 for this job. Google’s Imagen 4 Fast at $0.010 per image is the budget pick and the one we used: six frames cost six cents. Black Forest Labs’ FLUX.2 Pro at roughly $0.055 per image is the texture pick: glass, fabric, and coffee crema look closer to photographic. Midjourney v7 is the aesthetic pick if you have a subscription already but has a more restricted commercial-license posture, so verify before you commit.

The frames are the contract you will hand to the video model in the next step. Treat them as such. Two takes that did not work, for honesty: the first pass of the mailbox shot had the hand entering frame from the wrong side (right, when the eye-track for a 9:16 frame goes left); the first pass of the slow-pour shot rendered the coffee as cocoa-brown with no crema. Both fixed with a single line added to the prompt (“left-handed insert, gold-ringed crema”) and another six cents of retries. Showing the failures is not optional. If the post pretends the prompts “just worked,” the recipe is not reproducible.

A pencil-sketched storyboard with six panels laid out on a desk next to a coffee cup. — The storyboard is the contract between the LLM and the video model. Photo by Nasim Keshmiri on Unsplash.

Video shots: image-to-video, every time

Three frontier video models matter for short-form work in May 2026, and they are not interchangeable. OpenAI’s Sora 2 lands at $0.10 per second for the standard 720p model and is the strongest at narrative continuity: if you tell it “hand reaches in, picks up the bag, turns to the camera,” it tracks the motion through a clean ten-second clip. Google’s Veo 3.1 Fast is $0.15 per second with native audio and is the strongest at consistency across multiple reference images: Veo accepts up to four. Kling 3.0 is the cheapest of the three at roughly $0.07 to $0.14 per second and is the budget option for B-roll where the camera is locked off.

Five shots, six seconds each, mixed across the three models for the right reason in each beat. The mailbox hook on Sora 2 because the motion arc has to land cleanly. The product close-ups on Veo 3.1 Fast with the storyboard frame as reference because Veo is the most faithful to a product label. The slow-pour B-roll on Sora 2 because the cinematic motion is its strongest suit. We did not use Kling in this run but it is the right pick if you are iterating fifty hook variants and the cost-per-second sensitivity is what bites you.

Five video generations at six seconds each, averaging $0.13 per second across the mix, lands at $3.95 of credits. Two of those generations were re-rolls: the first slow-pour came out with the cup at the wrong angle, and the first mailbox shot had a delivery driver hallucinated into the back of frame. The honest cost-per-finished-clip figure includes the re-rolls. Anyone telling you their AI ad “cost sixty cents” is quoting the floor, not the median.

A phone held vertically, showing a 9:16 video preview with captions burned in. — The ad lives or dies in the first 1.5 seconds on a phone held vertically. Photo by charlesdeluvio on Unsplash.

Voiceover and music: TTS is ready. Music is legally tricky.

Voiceover went to ElevenLabs Flash v2.5, at half a credit per character on the multilingual model. The full 28-second script was 312 characters, about seven cents. Two prompts that earned their keep: a comma before the promo code (“the first bag, ten dollars off”) and a deliberately un-rushed pause between beats four and five via <break time="0.4s"/> in the SSML field. Both are the kind of detail a human narrator would give you for free in a booth, and that you have to remember to ask for in a TTS prompt.

Pick a voice that matches the brand, not the demo. The default ElevenLabs voices are excellent for podcasts and audiobooks; they are too polished, too radio-voiced, for a small-roastery brand. The right pick was a younger conversational voice with a hint of vocal fry. The wrong pick was the most-played-on-the-pricing-page voice. These two facts are almost always opposite.

Music is the part of this workflow that needs a lawyer’s hat, briefly. Suno and Udio both produce ad-quality beds in under a minute. Suno’s license explicitly grants commercial use on its Pro and Premier plans ($10 and $30 a month respectively) and is silent on the free tier in a way that means “no commercial use, full stop.” The catch worth knowing: Suno is still in active litigation with the major labels over training data, and Suno does not indemnify you against a third-party claim. For a small-roastery ad, the risk is nominal. For a brand that runs a million dollars a month in paid media, the call is “buy a stock-music license instead.” Name the trade-off in your brief; do not paper over it.

The cut is where AI ads still live or die

Open CapCut Desktop. It is free, it has a vertical preset, and it handles 1080×1920 H.264 export at 8-12 Mbps without a paywall. The other credible picks (DaVinci Resolve, Premiere Pro, Final Cut) are all overkill for a 30-second cut and will cost you either money or twenty minutes of project setup. CapCut won’t.

Eleven minutes in the editor breaks down as follows. Two minutes to drop the five video clips onto a 9:16 timeline at 1080×1920. Three minutes to time the voiceover to the cuts: this is the single most important editorial decision in the whole post, and the only place a software tool cannot help you. Two minutes to drop the Suno music bed under the voiceover, ducked to roughly −12 dB beneath the VO so the words sit on top of the music. Two minutes for auto-captions in CapCut’s built-in captioner, then a sanity pass to fix “Bayfield” (the auto-caption thought we said “bay-fielded”) and to bold the promo code in the closing card. Two minutes for export.

Captions are not optional. The TikTok-and-Reels default is autoplay-muted on a vertical phone, and a vertical ad without captions is a vertical ad with the sound off, which means it is half-dead. The CapCut auto-captioner is the cheapest way to get them; the sanity pass is the part that earns them.

One disclosure detail worth flagging here, because it is the kind of thing that gets a paid ad rejected on Tuesday morning. Meta and TikTok have both rolled out AI-disclosure requirements in 2025 and 2026. The threshold across both platforms is now “label any significantly-AI-generated content”, and TikTok provides a one-click toggle in the upload flow that adds the standardized label. TikTok’s policy on synthetic and manipulated media is the strictest of the major platforms: it is mandatory for any ad whose imagery could be mistaken for real footage. Toggle the label. The reach hit is small. The policy violation hit is large.

The receipt

The whole point of writing this kind of post is to leave a receipt the reader can mentally hold. Forty-six minutes is faster than booking a studio. $4.58 is less than the price of a single stock-photo download. Both numbers are the median across three runs, not the cherry-picked floor. Your own numbers will move; the shape of the spend will not. Video is the line item that dominates. Audio and the LLM are rounding-error. The cut is the part you cannot offload.

Time & cost ledger · one 30-second vertical ad

Step

Tool

Min

USD

01 · Script & storyboard outline

Claude Sonnet 4.6 (free tier)

$0.00

02 · Storyboard frames (6)

Imagen 4 Fast

$0.06

03 · Video shots (5 × 6s)

Veo 3.1 Fast + Sora 2

$3.95

04 · Voiceover (28s)

ElevenLabs Flash v2.5

$0.07

05 · Music bed

Suno v4 (Pro, amortized)

$0.50

06 · Cut & export

CapCut Desktop (free)

$0.00

Totals

$4.58

Compare the receipt to a traditional 30-second product spot. A small production-company quote in 2026 lands somewhere between $5,000 and $40,000 for a single deliverable, depending on talent, locations, and the size of the agency markup. The AI workflow trades two things for the price drop: control over fine craft (lighting that exactly matches the brand, an actor’s improvised moment, a real product hero shot), and the brand-safety story that comes with a human-led process. The AI version is not a strict replacement for the high-end shoot. It is a strict replacement for the bottom of the funnel: the ten variants you needed to test before you knew which one to actually shoot. That is the workflow this post is for.

What we’d change next time

The honest pass. Two beats did not quite work. The slow-pour shot, on third generation, still has a subtle continuity break at the four-second mark where the cup re-positions a fraction of an inch between frames; next time we would generate that beat at four seconds rather than six, and cut on the motion rather than letting it run. The closing CTA card is a CapCut still over a tinted background; it is the weakest visual in the cut, and the right move is to spend a seventh image generation on it rather than trying to make typography carry the weight.

The bigger change. We mixed three video models because each had a strength worth using; we paid for that with three sets of credentials to manage and three style mismatches to reconcile in the cut. For a next pass, the cleaner workflow is single-model: Veo 3.1 Fast for all five clips, with the storyboard frames as locked references, and accept a slightly worse mailbox-hook shot in exchange for visual consistency through the ad. The minute savings would be material; the cost savings would be modest.

The most useful pattern out of this whole exercise was not any single tool. It was the discipline of writing the brief first, the storyboard second, and the video third, in that order, with the LLM as a beat-sheet collaborator and the image model as a contract for the video model. The same discipline rescues the workflow whether you are running Sora, Veo, Kling, or whatever ships next quarter. The tools will be different in 2027. The shape of the work won’t. We intend to re-shoot this post every twelve months: the slug stays, the dates and tools update.

For the broader take on which video models earned their keep this spring, our spring 2026 catalog runs the same comparison across a longer test set. For the audio picks behind the voiceover and music steps, see the audio catalog. And for the underlying argument about why a curated five-tool stack beats a sprawling subscription pile, the curation post is the place to start.

How to make a 30-second AI video ad, end-to-end

The brief is the part nobody wants to write

A 30-second ad is seven beats. The LLM is good at the beats.

Storyboard frames: one per beat, fidelity over creativity

Video shots: image-to-video, every time

Voiceover and music: TTS is ready. Music is legally tricky.

The cut is where AI ads still live or die

The receipt

What we’d change next time

Sora vs Veo vs Kling in 2026: one shutdown, one successor, one survivor

ByteDance models with real examples: Seedream and Seedance

Most AI apps are wrappers, and you're paying the markup

One-time payment. Yours forever.