Is the video actually my sales manager or an AI avatar?

It is 100% your person on screen. They record one video, and our AI clones their voice pattern. We then generate personalized audio segments, names, dates, vehicle details, that sound exactly like them. The video stays authentic; only specific audio moments are modified. No synthetic faces, no deepfakes.

Can customers tell a VoxRefine video is AI-generated?

No. The video is genuinely your team member. VoxRefine's voice synthesis passes blind perception tests with 98%+ accuracy because it uses the team member's voice as the source model. There are no synthetic faces, no deepfakes, and no visible rendering glitches.

How many personalized videos can VoxRefine generate per month?

VoxRefine processes 10,000+ videos per hour across its distributed GPU cluster. Auto-scaling infrastructure maintains sub-50ms render time whether a dealer sends 50 videos or 50,000. Built for dealer groups running multi-rooftop operations.

Which dealership CRMs and DMS platforms does VoxRefine work with?

Whatever CRM your BDC already uses: CDK, Reynolds & Reynolds, Dealertrack, VinSolutions, DriveCentric, or anything else. VoxRefine captures the lead data from your existing CRM workflow, so videos send automatically on the events the BDC already acts on: appointment set, status change, custom tag. Most dealers are fully live within 48 hours, with no integration project, no vendor-side sign-off, and no IT ticket.

What does a dealership need to provide to get started with VoxRefine?

One 60 to 90 second video per person being cloned. Smartphone quality is acceptable with good lighting and clear audio. VoxRefine provides a script template with the exact phrasing and pauses needed. The pipeline handles noise reduction, voice isolation, and model training. A working demo is typically ready within two hours of uploading.

Are customers actually able to tell AI avatars from real video?

More of them, more of the time. The general public has been trained on synthetic faces across marketing, customer support, and social media for the last few years, and pattern-recognition compounds fast. The signals that give an avatar away are micro: eye movement that doesn't track a real focal point, blink cadence on a metronome instead of a breath, a smile that resets cleanly between sentences. A customer doesn't need vocabulary for any of that. They know the face is wrong inside the first few seconds, and once they decide it isn't real, the message reclassifies in their head from "my salesperson" to "automated marketing." Some customers won't notice or won't care, especially on short scripted sends. The risk is asymmetric: the ones who do notice often assume the rest of the dealership's outreach is fake too, and that trust cost is hard to walk back.

What's the difference between AI voice cloning and AI avatars?

An AI avatar generates the face on screen. The person in the video does not exist. The platform produced the eyes, the mouth, the head movement from a model. AI voice cloning leaves the face alone. It takes a recording of a real person's voice and synthesizes new audio in that same voice. With a cloned voice approach, the salesperson on screen is the actual salesperson; the only AI-generated layer is the personalized audio segments: the customer's name, the specific vehicle, and the appointment time, all spoken in that rep's own voice. Different category, different trust profile, different failure mode.

Is real-face plus cloned voice actually scalable?

Yes, because the bottleneck moves off rep time. A salesperson records one source video, usually 60 to 90 seconds, and the system generates thousands of personalized variants from that single recording. Throughput becomes compute-bound, not human-time-bound. A rooftop sending 10,000 emails a month can get every lead a personalized video from the assigned salesperson without anyone in the BDC recording anything new. Scale is real; the constraint is on the audio layer, which is where AI is already very good and improving fast.

Why would any dealership use AI avatars then?

Because for some sends the on-screen face isn't load-bearing. A recall notice, a hours-change announcement, a generic service-special blast: the customer isn't going to physically walk into the showroom and look for the person from the video. The avatar approach also wins on time to first send: pick a persona, type a script, ship. If a dealership values speed over face continuity for a specific send, an avatar is a reasonable engineering choice. The argument here isn't that avatars are bad. It's that the sends a dealership actually makes money on are the ones where face continuity matters most: appointment confirmations, no-show follow-ups, equity mining, and service reminders from the assigned advisor. Those are the sends where the avatar tradeoff lands wrong.

Covideo AI vs Real Salesperson Videos: Which Dealership Customers Actually Trust

The actual tradeoff

Personalized video for dealership outbound has been stuck in the same hole for a decade. Manual recording works, but rep time is the ceiling. Templates scale, but customers stopped opening them. Anyone trying to fix this had to pick a side. Generate the face, and you get scale. Hold the face, and you preserve trust. Nobody gets both for free.

Covideo, which has been in dealership video for 20+ years and serves around 3,500 rooftops, picked scale. Their newer AI Video Agent generates a synthetic face from a small library of named personas, lets the dealer script a message, and ships in minutes. No recording step. Zero salesperson coordination. The on-screen face is computed from a model, not pulled off the sales floor. That is a defensible engineering choice. For a vendor with 3,500 dealers to serve, it is arguably the only choice.

VoxRefine picked the other side. We leave the video alone. The face on screen is the actual salesperson, recorded once on real footage. AI is only used on the audio segments that change per customer: the name, the specific vehicle, the appointment time, spoken in a clone of that same rep's voice. The bet is that for dealerships specifically, the face has to match across the entire journey, and that requires holding the video constant.

What dealership customer behavior tells us

The thing that makes automotive different from almost every other industry that uses video outbound is this: the customer physically shows up. They drive to a parking lot, walk through a glass door, and scan the showroom for someone they recognize. Five seconds between the front door and the first handshake. That is the window everything before it was building toward.

Foureyes data on dealership lead behavior shows the obvious version of this: leads who get a personal touch from the assigned salesperson before the in-store visit convert better than leads who get a generic-feeling touch. Strolid and other BDC-focused groups have been saying the same thing for years. None of this is novel. What is changing is that the customer's read on what counts as a personal touch has tightened. A synthetic face used to feel futuristic. Now it feels like automated marketing. The bar moved.

In our R&D work with Premier Automotive, the consistent pattern is that the appointment-confirmation video is the send that does the most work. Get that right and show-rate moves. Get it wrong and the rest of the funnel cannot recover. The send that decides whether the customer walks in the door is the one place a dealership cannot afford to break face continuity.

The three approaches, side by side

Two synthetic-avatar options and the real-face option, on the dimensions a GM actually decides on.

Approach	What's synthetic	Face continuity to in-store	Best for
VoxRefine (real face + cloned voice)	Audio only: name, vehicle, appointment time	Preserved. Customer meets the same face from the video.	Appointment confirmations, no-show follow-ups, service reminders, equity mining
Covideo AI Video Agent	Face and voice: named AI personas (Megan, Laura)	Broken. Customer meets someone different at the door.	Recall notices, hours changes, generic announcements
Synthesia, HeyGen, similar avatar tools	Face and voice: fully generated speaker	Broken. Generic spokesperson, not dealer staff.	Training videos, internal comms, contexts with no in-person handshake

Where AI avatars are the right call

The honest version of this argument has to acknowledge where synthetic avatars win, because they genuinely do win in plenty of places. Anywhere the on-screen face is never going to physically meet the viewer, the avatar tradeoff inverts.

Training videos are the cleanest example. A 12-module compliance course narrated by a synthetic spokesperson is fine. Nobody is going to look for that person in a hallway later. Corporate internal comms, multi-language localization at scale, generic awareness ads where the face is decoration, support knowledge base walkthroughs, product explainer overlays inside a SaaS app: every one of these is a context where face continuity does not exist as a requirement in the first place. Synthesia and HeyGen built large businesses serving exactly those use cases. Covideo moving into avatars is a reasonable expansion into that same shelf.

The argument here is narrower than "avatars are bad." It is "avatars are wrong for the specific dealership sends that decide whether a customer walks in the door."

What is actually behind the customer trust shift in 2026

Two things are happening at once. First, consumer AI literacy is up sharply since 2023. Most adults have now interacted with a chatbot, an AI photo filter, and a synthetic-voice phone tree. Pattern recognition for "this is generated" happens faster than it did even 18 months ago. Second, synthetic-content fatigue is rising in the same population. The same people who were impressed by an AI demo two years ago are now annoyed by the third synthetic customer-service video this week.

The exact rate of either shift is hard to nail down. Every vendor publishing a number has a horse in the race, this one included. But the directional trend is consistent across the industry surveys that do exist. Treat the specifics as ranges. The qualitative point holds regardless of the decimal: the bar for what reads as a personal touch is moving, and it is moving against synthetic faces.

A note on Covideo's strategy specifically

Covideo expanded into AI avatars because the market asked for it. Their core manual-record product is solid; it has been the category default in dealership video for two decades for good reason. But manual record bottlenecks on rep time, and customer demand for at-scale video kept climbing. From a product perspective, they had two ways to scale: build a real-face plus cloned voice pipeline (technically harder, narrower category) or license a generative avatar stack (faster to ship, broader use cases including non-automotive). They picked the second. For a company with 3,500 dealers and an addressable market beyond automotive, that is a reasonable call.

We picked the first because we are not trying to be a general-purpose video platform. VoxRefine exists to fix one specific problem for one specific industry: scaled outbound video for car dealerships where the face on screen is the face the customer is about to meet. Different scope, different constraint, different answer. Both can be right.

Covideo AI vs Real Salesperson Videos: What Works at the Dealership