We Tested 7 AI Video Models on the Same DTC Brief. Results Were Genuinely Surprising

Invalid Date·12 min read

Every AI video tool is now claiming to use "the best model on the market." When you read the marketing closely, "the best" varies by which week the page was written. Veo 3.1 was best in October. Sora 2 Pro was best in November. Kling 3.0 Pro was best for movement. Hailuo was best for lip-sync. The category is moving fast enough that any static claim is wrong by the next quarter.

What we actually wanted to know was simpler. Given a single, realistic DTC brief, what would each of the leading models produce, and what does the difference between them look like in practice? So we ran the test. This is an AI video model comparison for marketers who are tired of reading top-ten lists and want to know which model wins which job.

The short version: model choice matters less than most teams think, and prompt quality matters more. We will explain.

The methodology

We picked one brief that represents a common DTC ask. Specifically:

A 30-second testimonial-style ad for a magnesium glycinate supplement. Audience: stressed professionals aged 30 to 45 who struggle with sleep but have grown sceptical of melatonin. Hook: "I tried every sleep aid before this." Setting: bedroom, warm low-key lighting, mid-evening. The talent is a 35-year-old woman in casual loungewear, speaking directly to camera. Style reference: A24 muted film grading, soft window light, slight handheld feel. Mood: honest, unrehearsed, intimate.

The brief is designed to test the things DTC creative teams actually care about. Photorealism of the talent. Naturalness of the lighting. Authenticity of the testimonial register. Adherence to a specific stylistic reference. Audio direction (the testimonial line itself). Basic cinematography conventions (the soft window light).

We ran the same brief through seven models, with identical settings where the platform allowed it (resolution, duration, aspect ratio, seed where exposed). Where a model's preferred input format differed (some want comma-separated terms, some want flowing prose, some want explicit audio direction), we adapted only the syntax, not the content.

The seven models tested:

Veo 3.1 Standard (Google DeepMind)
Veo 3.1 Fast (Google DeepMind)
Sora 2 Pro (OpenAI)
Kling 3.0 Pro (Kuaishou)
Grok Imagine (xAI)
Seedance 2.0 (ByteDance)
Happy Horse / Hailuo (MiniMax)

All assessments below are based on our extensive testing across this brief and the broader test suite we run on category briefs each month. We are not embedding output footage in this post because each customer's experience will vary by brief and we do not want to claim a single test run as definitive. The patterns, however, are stable.

Skip the manual testing. Tonic auto-routes to the right model for your brief.

The seven models, head to head

Veo 3.1 Standard

Strengths. Veo 3.1 produced the most photorealistic skin tones and natural lighting of any model in the test. The soft window light read as actual diffused daylight rather than a generic "evening" preset. Talent micro-expressions felt unrehearsed in the way the brief asked for. Native audio direction is the killer feature: you can write "her voice softens on the third sentence" and the model will deliver something close to that.

Weaknesses. The model occasionally over-stylises the colour grade when "A24" is in the prompt, going further into desaturation than the reference actually does. Veo Standard is also slow and expensive at scale.

Where it wins. Hero testimonial pieces where photorealism, native audio, and cinematography fidelity matter more than turnaround time. Brand campaign anchors. Anywhere the cost of a re-shoot would be high.

Cost. The most expensive option in the lineup per generation, but the lowest re-generation rate, which often makes it cheaper per usable asset.

Veo 3.1 Fast

Strengths. Most of the Veo 3.1 Standard quality, at roughly a third of the cost and meaningfully faster. For typical paid social testimonial work this is the workhorse choice. The naturalness of motion is better than any other "fast tier" model in the lineup.

Weaknesses. Skin tones lose a touch of subtlety compared to Standard. Complex camera moves (slow dolly with a focus rack) sometimes simplify to a static frame. Audio direction is less precise than Standard.

Where it wins. High-volume testimonial work where you need ten to thirty assets per week and cost matters. Creative testing.

Cost. Roughly one-third of Standard. Currently the best price-to-quality ratio in the testimonial segment of our test suite.

Sora 2 Pro

Strengths. Sora 2 Pro is unmatched on action sequences and physical-world plausibility. If your brief involves the product being used (someone pouring magnesium powder into water, opening a bottle, taking a capsule), Sora handles the object physics better than anything else. The model is also excellent on multi-character scenes and dialogue.

Weaknesses. For a single-talent direct-to-camera testimonial, Sora's tendency toward dynamism worked against the brief. Talent often shifted, looked away, or turned, when we wanted stillness. Skin tones can read slightly synthetic in close-ups. Currently more expensive than Veo Fast for similar quality on testimonial work.

Where it wins. Product-in-use scenes. Anything kinetic. Multi-character scenarios. Lifestyle b-roll where action carries the shot.

Cost. Comparable to Veo 3.1 Standard.

Kling 3.0 Pro

Strengths. Kling has the most sophisticated camera-movement vocabulary in the field. If your brief specifies "slow push-in transitioning to a 35mm portrait composition," Kling executes the move with precision the other models tend to approximate. Excellent on stylised content where the cinematography is the hero.

Weaknesses. Kling's preferred prompt syntax is comma-separated cinematography terms, which is unintuitive for marketers used to writing prose briefs. Outputs can read more cinematic-stylised than authentic-testimonial, which made it the wrong choice for the brief we tested.

Where it wins. Hero brand pieces, product films, anywhere the brief is built around camera language. Style-led content.

Cost. Mid-tier.

Grok Imagine

Strengths. Surprising on stylised aesthetic content. Strong with bold colour grading and fashion-adjacent content. Iteration loops are fast.

Weaknesses. For a photorealistic, intimate testimonial brief, Grok's outputs leaned more toward a "look" than a "moment." Lip-sync and audio direction are noticeably weaker than the Veo line.

Where it wins. Bold visual identity work. Fashion. Beauty hero shots where the aesthetic is more important than testimonial register.

Cost. Mid-tier.

Seedance 2.0

Strengths. Seedance 2.0 is the strongest model in the lineup for product-reference work. If you have a real product image and need the model to render it accurately in a generated scene, Seedance is the answer. Also excellent on macro detail shots.

Weaknesses. Less suited to testimonial-style content with a person on camera. The model is optimised for product hero scenarios.

Where it wins. Product hero shots, packaging-led scenes, macro detail. The "the bottle on the kitchen counter" shot.

Cost. Lower tier per generation.

Happy Horse (Hailuo)

Strengths. Best-in-class for handheld UGC aesthetic. The model has been tuned on a corpus that captures the slight imperfection of phone-shot content in a way the bigger labs have not matched. Multilingual lip-sync is also strong, which matters for brands running creative across multiple markets.

Weaknesses. Photorealism in close-ups is a step behind Veo. Camera-movement vocabulary is more limited.

Where it wins. Handheld UGC, especially for international markets where lip-sync matters. Lo-fi authenticity.

Cost. Lower tier per generation.

The big surprise: cinematography prompting matters more than model choice

Here is the result that genuinely surprised us. When we ran the same brief through every model with a generic prompt, the differences between models were significant. When we ran the cinematography-enriched brief through every model, the differences narrowed.

The generic prompt was something like "a 30-second testimonial for magnesium glycinate, woman in her thirties, bedroom setting." Each model interpreted this differently. Some did golden hour, some did fluorescent overhead, some did harsh midday. Some did a close-up, some did a wide. The variance between models was enormous.

The enriched prompt specified "50mm portrait lens, soft window light from camera-left, slight handheld feel, A24 muted grade, talent at three-quarter angle, evening warmth in the colour temperature, 30-second duration with a slow push-in across the line." Every model produced something recognisably similar to the brief. The variance between models was much smaller. The variance between models on the enriched brief was smaller than the variance between generic and enriched on a single model.

The takeaway is uncomfortable for the "best model" framing. The lift from better prompting is bigger than the lift from picking a better model. If you are spending time evaluating models without first locking down your prompting framework, you are optimising the wrong layer.

We have written more on this in our guide to writing AI video prompts that look professional, which walks through the ten cinematography elements that drive most of the perceived quality difference.

Want auto-translated, cinematography-enriched prompts? Start free with 50 credits.

How model translation changes results

The other surprise: the same enriched cinematography brief, translated to each model's preferred input syntax, produced meaningfully better results than the same brief sent verbatim to every model.

Models have dialects. Kling wants comma-separated cinematography terms in a specific order. Veo wants flowing prose with audio direction in natural language. Sora wants vivid action verbs and present-tense phrasing. Grok responds to aesthetic references and style adjectives. Hailuo prefers compact, declarative briefs.

When we ran a brief in the wrong dialect, output quality dropped roughly to the level of a generic prompt. When we ran the same brief in the right dialect, we got the cinematic results the brief was designed to produce. The dialect work itself is mechanical, but you have to know the dialects, and you have to know which model expects which.

This is what model translation is solving. The user writes one brief. The platform translates that brief into seven dialects and routes the right one to the right model. The user does not have to learn seven dialects to get the seven models to perform.

We covered the comparison framing more broadly in our piece on horizontal versus vertical AI video tools, but the dialect issue is the most concrete example of why orchestration matters more than model access.

What this means for tool choice

If you have to manually pick a model every time you generate, you are spending time on the wrong problem. The model choice is a routing decision, not a creative decision. Routing is exactly the kind of thing software is good at and humans are bad at.

The questions you actually want your team to be answering are creative. What is the hook? Who is the talent? What is the emotional register? What is the cinematography intent? Those are decisions humans should be making. Which of the seven leading models will best execute the cinematography intent for this specific brief is a decision a routing layer can make, faster and more accurately than a marketer reviewing release notes.

This is the gap horizontal tools leave open. They give you access to many models. They do not give you the routing layer that turns access into output. You end up either (a) defaulting to one model and giving up the upside of the others, or (b) doing the routing manually, which is time you should be spending on creative.

How Tonic's auto-routing works (in broad strokes)

Without giving away the full IP, here is the shape of what Tonic does.

A brief enters the platform. Phase 1 enriches the brief with cinematography appropriate to the content type (testimonial, product hero, lifestyle, etc.). The enrichment uses a learned framework that maps content types to the cinematography conventions that work for them.

Phase 2 selects the right model for the enriched brief based on the content type, the visual reference, the cost tier, and the aspect ratio. The same enriched brief is then translated into the dialect that model expects.

Phase 3 runs compliance checks for the brand's category and rewrites any non-compliant claims before generation. This matters more for some categories (supplements, skincare) than others, and the rewrites are surfaced in the audit trail so legal teams can verify what was changed.

The output is a generation that uses the right model, in the right dialect, with the right cinematography, screened for the right compliance framework. The user did one brief.

The "show-me-the-prompt" UI exposes every step. You see what you wrote, what cinematography enrichment was applied, what compliance rewrites happened, which model got selected, and what the dialect-translated brief looked like. The whole pipeline is auditable.

The case for orchestration over picking models

The market for AI video tools is bifurcating. On one side, horizontal platforms compete on model breadth: who has access to Sora 2 Pro, who has Veo 3.1 Fast, who can offer Kling 3.0 Pro at the lowest price. On the other side, vertical platforms compete on orchestration: given access to all the models, who routes them most intelligently to your specific use case.

For DTC brands, the orchestration side is where the real lift is. Model breadth without orchestration is a longer list of decisions you have to make manually. Orchestration without breadth is a smaller toolkit but less work to use it.

Our bet is that orchestration wins for DTC specifically because DTC creative is a high-volume, high-iteration discipline where the team's time is the constraint. Anything that removes a manual decision per asset compounds quickly across a quarter's worth of paid social.

If you are running an AI video pipeline for a DTC brand right now and you are still picking models manually, the question is whether your time is better spent on that or on the creative work the routing decision is preventing you from getting to.

For brands running £5M+ in annual revenue, a guided walkthrough is the fastest way to see how routing performs against your real test suite. For everyone else, 50 welcome credits on the free tier is enough to test five or six briefs end to end.

Try Tonic Studio free

30 seconds to your first AI-generated UGC video. No credit card required.

Get started