Straight answers on how voice cloning works, why we refuse to put synthetic faces on screen, which CRMs we plug into, and what it costs to get a rooftop live. No fluff.
Yes, 100%. Your team member records one 60–90 second source video, and that footage is what the customer sees. We never generate, morph, or lip-sync their face. The only thing AI touches is a handful of personalized audio segments — the customer's name, the vehicle, the appointment time — rendered in your salesperson's own cloned voice.
We take one clean recording from your salesperson, isolate their voice, and train a neural TTS model on their cadence, tone, and accent. When a video is generated for a specific lead, CRM data fills the personalized slots (name, vehicle, time) and the model renders those phrases in the rep's voice. The rendered audio is stitched into the original video at the exact marks we captured during the source recording. Blind-test accuracy runs 98%+ — customers can't tell which words were spoken live and which were rendered.
Only if you keep the face real. The uncanny-valley problem is a face problem, not a voice problem — customers have been hearing AI voices in phone trees, GPS, and smart speakers for a decade and tune it out. Generated faces are the thing that breaks trust. VoxRefine ships AI where customers have already accepted it (voice) and keeps humans where it still matters (face).
One 60–90 second recording per person you want cloned. Smartphone quality is fine if the lighting is decent and the audio is clean. We send a script template with the exact phrasing and pauses we need for the slot marks. Our pipeline handles noise reduction, voice isolation, and model training. Most teams have a working demo within 2 hours of upload.
AI avatars generate a synthetic face reading a scripted message — the person on screen never existed. Voice-cloned video, the VoxRefine approach, keeps your actual salesperson on screen and only generates the personalized audio slots. The customer sees a real staff member they'll meet in person; the AI is doing the tedious per-lead customization in the background.
Most dealers we talk to say yes, and the data backs them up. Customers spot synthetic faces fast — odd micro-expressions, mismatched eye gaze, smooth-skin rendering — and once they notice, the whole message reads as inauthentic. Dealership trust is built in-person over years. Starting that relationship with a fabricated face is a bad opening move. That's the entire reason VoxRefine exists in a category of one.
They trust videos that look and sound like the real person they'll meet. They don't trust videos where the face is generated — those get flagged, forwarded to friends as a joke, or ignored. Keep the face human and the personalization AI-driven and the trust problem largely goes away. That's the design choice VoxRefine made and AI-avatar tools like Covideo's 'Megan,' 'Laura,' and 'Lauren' didn't.
Record your salesperson once on their phone. Upload the clip. Tell us which phrases are the personalization slots (name, vehicle, appointment time). We clone their voice, train the model, and expose a CRM webhook that generates a per-lead video every time a trigger fires. No synthetic face, no avatar library, and no rep sitting there recording each video by hand the way TradePending Video (Snapcell), VentaVid, and CarFilm require. Just your actual people at scale.
Native integrations with CDK, Reynolds & Reynolds, Dealertrack, VinSolutions, DriveCentric, and the major DMS platforms. Setup is one API key plus one webhook. Videos trigger on the events your BDC already uses — appointment set, appointment confirmed, status change, custom tag. The BDC workflow itself doesn't change; videos just start going out.
10,000+ per hour across a distributed GPU cluster, with sub-50ms render time per personalized segment. A 5-rooftop group sending 50,000 videos a month is well inside cruising speed. There's no queue, no degradation, no 'please wait.' Auto-scaling means a Monday-morning appointment-blast behaves the same as a slow Sunday.
Most dealers are live within 48 hours. Day 1: record source videos, drop in the API key, configure one webhook. Day 2: test sends from a staging appointment, then flip the CRM trigger to production. No workflow change for the BDC team. No custom IT project.
We don't publish pricing — every dealer group has a different rooftop count, send volume, and integration surface, and a public list price would be wrong for 90% of them. Pricing is monthly per rooftop with volume tiers on videos per month. At plausible show-rate-lift assumptions, payback typically lands under 2 months. Request a quote and we'll send real numbers for your footprint same day.
Covideo is priced per user per month — a standard per-seat SaaS model — with their AI avatar add-ons priced separately. That shape makes sense for a tool built around reps recording one-to-one videos. VoxRefine is priced per rooftop with volume tiers because the send volume is automated, not bounded by rep recording time. Different pricing shape, different problem being solved — the two can even coexist at the same dealership. For current Covideo list pricing, check their website or ask your rep.
Three things: number of rooftops, monthly video volume, and number of distinct people you want cloned (each voice is its own trained model). Integration complexity is flat — CDK and Reynolds plug in the same way. Standard onboarding is built into the engagement; anything genuinely custom (unusual DMS, multi-entity provisioning) is scoped separately.
Send us the specifics of your rooftop and we'll show you a personalized video built with one of your actual team members — typically within 48 hours.
Book a demo →