R&D pilot: Premier Automotive is using VoxRefine to scale personalized appointment videos. See how
Blog · Dealership Video

How to make an AI car video

The honest answer, without the “future of automotive marketing” filler. Three categories of AI car video, what customers actually see in each, and what we’d build on if we were in your seat today.

By VoxRefine · Published April 18, 2026 · 6 min read

If you searched “how to create an AI car video,” you’re not actually asking about software. You’re asking: my BDC is sending thousands of lead emails a month, my reps can’t record a personal video for each one, and templates are dying — what’s the real move? That’s the question worth answering.

Here’s the short version: AI car video is not one thing. It’s three different things, and the category you pick changes what your customer sees when they open the email. Pick wrong and you scale spam. Pick right and your showroom handoffs get easier. We’ll walk through all three, what each actually produces, what it costs to run, and where we’d put money if we were opening a new point tomorrow.

This is written for GMs, BDC managers, and sales managers already running outbound video at some scale. If you’re a single-rep used lot sending fifty leads a month, stop reading — manual Covideo is fine for you and nobody needs to sell you anything.

The three categories of AI car video (and what each one actually does)

Every tool in this market is really one of three approaches. Different bets on which part of the video should be AI-generated. Different tradeoffs on scale vs. trust. Naming them clearly is worth more than any feature sheet.

1. Manual record (the legacy tier)

This is the workflow most dealers already know. Your salesperson hits record, reads a script, hits send. The video is 100% real and 0% AI. Tools in this tier: Covideo classic (the 20-year incumbent, ~3,500 dealers, built the manual-record playbook), TradePending Video (Snapcell), VentaVid, and CarFilm. All of them do roughly the same thing: record, send, track the open.

The strength is authenticity — the person on camera is the person the customer will meet in-store. The weakness is a hard ceiling: rep time. A BDC can’t record 10,000 personal videos a month, no matter how many reps you add. This tier is great for ad-hoc walkarounds and one-off responses. It is not going to solve volume personalization for you.

2. Synthetic AI avatars (the generated-face tier)

Here, the face itself is generated. You type a script, AI produces a talking persona (“Megan,” “Laura,” “Lauren,” depending on the vendor), and the output is a video seconds later — no recording step. Tools in this tier: Covideo AI Video Agent (the headline move from the incumbent), Matador.AI (automotive-focused conversational AI that ships avatar video as part of its stack), and Synthesia-based setups (the generic AI-video engine used by a chunk of non-automotive players and a few automotive ones).

The strength is scale with zero recording friction. Any language, any script, any time. The weakness is the whole reason you probably landed on this blog post: customers increasingly smell the fake. The blink rhythm, the eye contact that never quite lands, the lip-sync that is 98% right and 2% wrong — these signals read as “AI,” and the moment the customer labels it that, the email becomes marketing, not a message from their salesperson. Worse: the face on screen isn’t on your sales floor. So when the customer walks in, they meet a stranger.

3. Real face + cloned voice (the tier nobody else is in)

This is VoxRefine, and we’re effectively alone here on purpose. The video is real unmodified footage of your actual salesperson. No synthetic face, no lip-sync manipulation. AI generates only the audio segments that change per customer — the name, the vehicle, the appointment time — using a clone of the salesperson’s own voice from a single 60–90 second source recording. One recording produces thousands of personalized videos. The customer sees the real person. Every send is personal.

The tradeoff: you need at least one team member willing to be the on-screen face, and you need a CRM the platform can pull personalization data from. For any dealer running CDK, Reynolds & Reynolds, Dealertrack, VinSolutions, or DriveCentric, that’s table stakes already.

What customers actually see

Forget the feature grid for a second. Imagine Maria, who just set an appointment for a 2024 RAV4 at 10:30 on Saturday. She opens three versions of the confirmation video on her phone.

Manual record: It’s a generic clip from the BDC rep recorded at 8am Tuesday — same video everyone got this week. No name, no vehicle, no appointment time. Maria watches the first four seconds and closes it. The open counts as a win in the analytics dashboard. The appointment still no-shows on Saturday.

Synthetic avatar: A generated woman says “Hi Maria, confirming your 10:30 appointment on the RAV4.” The voice is clean, the words are right, but the eyes don’t track correctly and the smile resets between sentences. Maria spots it in under five seconds. She shows up Saturday annoyed, and then meets a completely different person at the dealership — a man, not the woman from the video. The trust baseline is already underwater before the test drive starts.

Real face + cloned voice: It’s Jason from the sales floor, on camera in front of the actual showroom. His mouth is moving naturally because the footage is real. The audio says her name, her RAV4, her 10:30. Maria smiles. Saturday she walks in and recognizes Jason from fifteen feet away. Handshake, coffee, test drive. The relationship started before she got in her car.

The three approaches all technically “send a personalized video.” Only one of them delivers the thing you actually want, which is continuity of trust from inbox to handshake.

Want to see what a voice-cloned appointment video from your actual salesperson looks like? Send us a short clip — we’ll send it back personalized to a test customer.

Request a personalized demo →

What it costs to run each approach

Ignore the sticker price for a second and look at the cost structure. It tells you what will happen at scale.

Manual record is labor-bound. You pay per seat, but the real cost is rep time per outbound video. A BDC rep recording 30 videos a day is doing almost nothing else. Doubling your outbound volume means hiring. That math doesn’t bend.

Synthetic AI avatars are render-bound. Pricing usually looks like a per-minute or per-credit model (or a blended subscription that works out to the same thing). It scales with send volume, not with headcount. The real cost is the invisible one: the trust delta vs. a real-face send. Harder to put on a P&L, but it shows up in show rate.

Real face + cloned voice is compute-bound with a fixed ceiling. Our platform generates 10,000+ personalized videos per hour across a distributed GPU cluster, and that ceiling doesn’t care whether your BDC rep took lunch. One source recording, re-used for every appointment, follow-up, and service reminder for the life of that salesperson’s tenure.

None of the serious players list full dealership pricing publicly — every category requires a demo conversation — so take any specific “$X per video” number from a review site with skepticism. The cost structure is the real read.

What we’d do if we were starting today

Here’s the decision framework we’d use in your seat.

If your volume is under ~200 outbound videos a month and ad-hoc: Stay on manual record. Covideo classic or TradePending Video does the job. Don’t overbuy. When rep time stops being the bottleneck, revisit.

If your volume is automated and trust-sensitive (appointment confirmations, no-show follow-ups, service reminders, equity mining): Real face + cloned voice. This is the lane VoxRefine was built for. Face continuity from the video to the showroom is the whole game, and synthetic avatars break it.

If the message is transactional and the face really doesn’t matter — a generic service-campaign blast, a hours-change announcement, a promo that nobody’s going to care who delivered — AI avatars can fit. Know what you’re optimizing for: speed over relationship. Don’t use avatars for anything that needs to end in a handshake.

The mistake we see over and over: dealers use avatars to send what should have been real-face videos, because avatars are cheaper per send. They save three cents and cost six points of show rate. The math is obvious if you let yourself look at it.

One recording of the right salesperson, cloned once, reused forever. Every confirmation, every follow-up, every equity mining touch — in the voice of the person the customer will actually meet. That’s the play.

See it on your own salesperson

The fastest way to evaluate this category is to see your actual rep on screen with a cloned voice saying a test customer’s name. Send us a short clip. We’ll send it back.

Book a demoVoxRefine vs Covideo →

Related questions

What counts as an AI car video?

Any dealership video where AI does meaningful work on the output — either generating the face, generating the voice, or stitching in personalized data (customer name, vehicle, appointment time) from the CRM. A human salesperson hitting record on Covideo and reading a script is not an AI car video. A Synthesia-style avatar reading a prompt is. A real recorded salesperson with AI-generated personalized voice lines is too. The category is wide, and the approaches are not interchangeable.

Are AI avatar videos effective for car dealerships?

They scale, which is their main appeal. The tradeoff is face continuity: the synthetic persona on screen is not a member of your staff, so the customer who shows up at the dealership meets someone they've never seen. For scripted, low-stakes messages — service explainers, promo announcements — AI avatars can work. For appointment confirmations and follow-ups, where the goal is relationship, a generated face starts the relationship with a lie. Customers increasingly spot avatars, and when they do, trust drops.

Do customers trust AI-generated car videos?

It depends on which part is AI-generated. Customers are quick to flag a synthetic face — the eyes, the blink cadence, the lip-sync give it away within a few seconds. Voice-cloned audio on top of real video footage is much harder to detect: blind tests commonly clear 98% accuracy against the real recorded voice. The honest answer: customers trust real faces. Leave the face alone and the rest of the personalization holds up.

How much does an AI car video platform cost?

Manual-record platforms like Covideo classic are priced per rep per month and scale linearly with seats — more reps recording = more spend. AI avatar platforms typically price per generated minute or per video-generation credit. Voice-cloned platforms like VoxRefine are usually priced per rooftop with volume tiers, because the underlying cost is compute rather than human recording time. None of the serious players publish full pricing — all require a demo conversation — but the cost structure is the real question. Manual is labor-bound, avatars are render-bound, cloned-voice is compute-bound.

What is personalized video for dealerships?

A video sent to a specific lead that names them, their vehicle of interest, and usually their appointment time or service milestone. Done manually, a salesperson records each one. Done at scale, the personalization is AI-generated from CRM data. The point is the same: break through template fatigue. The dividing line between tools is how much of the video is automated and how much of the customer's trust the automation costs.

Related reading

VoxRefine vs Covideo →AI Avatars vs VoxRefine →All blog posts →