If you run a BDC, your outbound problem is math. You’re sending thousands of emails and texts a month. You want each one to feel like it came from the actual salesperson assigned to the lead. You can’t have that salesperson hit record thousands of times. So you have two honest choices: send templates everyone recognizes as templates, or put the voice on rails.
Voice cloning is how you put the voice on rails without giving up the real person. The dealer sends a personalized video — name, vehicle, appointment time — and the audio is genuinely in the assigned rep’s voice, because the underlying model was trained on that rep’s voice. No synthetic face. No avatar. No uncanny-valley tradeoff.
Here’s what’s actually happening, in language that doesn’t need a CS degree.
What voice cloning actually is (and is not)
Voice cloning is a neural model — a kind of text-to-speech system — trained on a specific person’s voice. You give it a 60–90 second source recording of that person speaking. It learns their tone, cadence, inflection, and vocal character. After that, you can feed it any text and it generates new speech in that person’s voice. Not a recording being replayed. Actual new audio, synthesized from the model.
It is not three things it’s often confused with:
Not a voice filter. The TikTok-style effect that shifts your pitch or makes you sound like a chipmunk is signal processing on live audio. Voice cloning generates brand-new speech from text. Different problem, different stack.
Not old-school TTS. The robotic Siri-era voices from a decade ago were built from phoneme concatenation — slicing a fixed library of recorded sound-units and gluing them back together. Modern neural TTS trained on the source speaker’s voice is a different technology generation. The output has the speaker’s actual vocal timbre and natural prosody because the model learned them, not because a slicer approximated them.
Not an AI avatar. AI avatars generate a face and lip-sync. That’s video synthesis, and that’s where the uncanny valley lives — the eyes, the blink, the mouth. Voice cloning doesn’t touch the video at all. At VoxRefine the video is unmodified footage of the real salesperson. Voice cloning only generates the audio segments that change per customer.
The 60–90 second source recording
One of the fair questions we get from GMs is: that’s it? A minute and a half of Jason reading a script is enough to clone his voice? The short answer is yes, because of what that minute and a half actually captures.
A neural TTS model doesn’t need to hear every possible sentence a person might say. It needs enough of that person’s speech to model four things:
Phoneme coverage. The set of sounds a voice makes — roughly 44 distinct ones in English. A well-written source script covers most of them in a minute. The model learns how this specific speaker produces each sound.
Cadence and pacing. Where the rep pauses, how fast they move, where they stress a word. This is what makes a voice sound like a person and not a news reader.
Inflection. The rises and falls. Jason asking a question sounds different from Jason confirming an appointment, and the model learns that shape.
Timbre. The vocal signature — what makes Jason recognizably Jason. This is mostly baked into the first few seconds of clean source audio.
What does matter: a decent microphone (a USB condenser is plenty — this isn’t a radio booth), a quiet room (sales floor at 7am before anyone’s in works fine), and a script varied enough to cover the phoneme range. Read a confirmation, a follow-up, a service reminder. Three use cases, ninety seconds, done.
How a “Hey Sarah, on that Tahoe…” video renders in under 50ms
Sarah books a 2pm appointment to look at a Tahoe. Your CRM (DriveCentric, VinSolutions, CDK, whichever) fires a webhook the second the appointment is set. Here’s what happens next, in order:
1. Webhook arrives. Customer data — name “Sarah,” vehicle “2024 Chevy Tahoe,” appointment “2:00 PM Saturday,” assigned rep “Jason” — lands in the render pipeline.
2. Template script fills in. Jason’s confirmation template has slots for name, vehicle, and time. The pipeline fills them: “Hey Sarah, looking forward to seeing you Saturday at 2 on the Tahoe.”
3. Neural TTS generates only the personalized segments. The model doesn’t regenerate the whole audio track. It generates just the words that change per customer — the name, the vehicle, the time. The fixed parts of the message are Jason’s original recorded audio from the source video. The generated segments match his voice because they’re produced from a model of his voice.
4. Audio stitches back into the video. The video layer is untouched — still Jason’s real footage. The personalized audio is timed to moments when his mouth is either not on camera or framed in a way that the stitch is invisible. The original source video is shot specifically with that in mind, which is why framing decisions matter on the record day.
5. Final MP4 streams out. On our distributed GPU cluster the render clears sub-50ms, and the pipeline handles 10,000+ videos per hour at steady state. In practice that means Sarah’s appointment-confirmation video is in her inbox before she’s out of the parking lot.
Want to hear what your own salesperson’s cloned voice sounds like? Send us a short clip of them on camera and we’ll render a personalized test video back.
Request a personalized demo →Why the output passes blind tests
In blind perception testing the generated audio clears 98%+ accuracy against the rep’s real recorded voice — meaning listeners can’t reliably tell which is which. That number surprises people. It shouldn’t.
The model isn’t impersonating the rep. It’s generating new sentences from a statistical model of the rep’s voice. Every acoustic feature — timbre, formant structure, breath pattern, micro-prosody — is derived from the source audio. The output has those features because they’re literally what the model was fit on. It’s not an imitation, it’s a resampling.
This is also why voice cloning clears perception tests and synthetic faces don’t. Human ears are forgiving on audio (we already accept phone compression, voicemail artifacts, Zoom packet loss). Human eyes are unforgiving on faces — we evolved to spot tiny errors in eye contact and blink cadence. Audio has a wider “sounds like a real person” tolerance zone. Cloned voice lands inside it. Synthetic video often doesn’t.
What voice cloning doesn’t solve
Worth being straight about this, because the pitch isn’t magic.
It doesn’t change the video. The rep on screen is always the same real rep from the source recording. You get the same visual every time. That’s a feature — it’s the exact person the customer will meet in-store — but it’s not a feature if you want a different face per lead.
It doesn’t touch mouth movements. Since we’re not doing lip-sync manipulation, the video has to be shot with framing that accommodates voice-over of the personalized segments — typically B-roll, over-shoulder shots, or cutaways during the words that will change per customer. This is a production decision, not a platform limitation, and it’s the main thing we coach dealers on when they record the source video.
It doesn’t work for languages you haven’t trained on. A voice model trained on English source audio speaks English. If your market needs Spanish confirmation videos, you record a Spanish source. Cross-lingual voice cloning is an active research area but it’s not what’s in production at VoxRefine today.
See it on your own salesperson
The fastest way to understand voice cloning is to hear it in your rep’s actual voice. Send us 90 seconds of clean audio. We’ll send back a personalized appointment confirmation to a test customer.