Two ways to build an AI voicebot: stitching versus realtime

The way an AI voicebot processes speech determines whether it sounds natural or sluggish and unreliable. There are currently two architectures in use amongst builders, and the choice between them has consequences for the call experience, the reliability of the system and what the bot can extract from a conversation. The older approach links three separate components in sequence, whilst the newer method processes the entire conversation in one go.
The classical approach: stitching
When the first voicebots were built, it made sense to link three existing components together. Incoming speech went through a speech-to-text engine that converted it to text, then a language model read that text and formulated a response, and finally a text-to-speech engine converted that response back into audible speech. This architecture is called “stitching” in the industry, because you chain three independent systems together into one pipeline.
For a time, that delivered usable results, and for teams that didn’t want to train their own speech model it was the only practical route. Yet in practice three vulnerabilities emerge, because each link in the chain can fail independently. Speech recognition can mishear a sentence, the language model can give a slow or incorrect response and voice synthesis can go down at an inopportune moment. Many teams therefore build in a backup with an alternative TTS or LLM supplier, so the bot continues working during an outage. That solves the downtime, but callers suddenly hear a completely different voice and become confused about whom they are actually speaking with.
The second drawback may weigh even more heavily. With stitching, the language model only sees a textual transcript, which means it cannot perceive the tone, volume, hesitation and emotion of the caller. An irritated customer and a satisfied customer sound identical to the model once their words appear on paper, and that comes at the cost of the contextual sensitivity that makes a conversation valuable. Signals about perceived age, native language or mood are lost in the translation to text, yet those signals are precisely what often determines how a staff member would conduct a call.
The new approach: one realtime speech model
Since OpenAI made gpt-realtime-1.5 available on 24 February 2026, there is a second way to build voicebots that works better in most cases. Instead of three separate components in sequence, one model listens and speaks directly, which means the entire layer of transcription and synthesis disappears. The model understands the words, the tone and the emotion of the caller at the same time, so it can respond to those directly in its answer. How smoothly that works in practice is shown well by a demo from Charlierguo.
That delivers concrete advantages in everyday use. There is now only one point where something can go wrong instead of three, which means the risk of downtime falls significantly. Response time typically sits below 400 milliseconds, so the conversation flows naturally without the latency that emerges with stitching. Multilinguality is built in, which means the same model switches effortlessly between Irish, English, German and other languages without you needing to configure that switch beforehand. And because the model processes audio instead of text, it recognises an irritated customer by their voice and can transfer them directly to a staff member without needing a keyword or explicit escalation for that.
When stitching is still the right choice
There remains a niche where the older architecture fits better, and those are situations where no live conversation needs to take place but rather a recording is analysed afterwards. When a call centre wants to have conversations summarised, coded or screened for compliance after the call has ended, there is no latency requirement and you can safely choose a specialised language model. Think of a medical language model that recognises abbreviations and specialist terminology in healthcare, or a speech-to-text engine trained specifically on a regional dialect. The precision on that one component outweighs the overall call experience in those scenarios, because there is no caller on the line waiting for an answer.
Our recommendation
For businesses that want to have live conversations handled by a voicebot, we advise the realtime approach in virtually all cases. The combination of faster response, lower vulnerability to outages, multilinguality without configuration and sensitivity to tone results in a call experience that callers do not perceive as robotic. For post-call analyses and other scenarios where precision on one specific component is decisive, we continue to deploy stitching architectures, because they still deliver the strongest result there.
Our team builds in both architectures
CallFactory builds voicebots in both architectures, depending on what fits best with your call flow. Whether you want a fully managed solution where our team sets up everything from start to finish, or you prefer a dedicated IVR on your own infrastructure, we deliver GDPR-compliant implementations that are available 24 hours a day, seven days a week.
Get in touch with our team to discuss which architecture fits your calls, how the integration with your existing systems works and within what timeframe the voicebot can go live. That way you get a clear estimate of the implementation timeline and the investment, and from day one you can have incoming and outgoing calls handled by a voicebot that speaks and listens at a level that was unthinkable until recently.
Frequently asked questions
Stitching is valuable when you don’t need a live conversation but instead want to analyse a recording afterwards. You then have the freedom to choose a specialised language model, such as a medical model for healthcare terminology or a speech-to-text engine trained on a regional dialect. In those cases, precision on a single component outweighs a smooth conversation experience.
Response time typically sits below 400 milliseconds, which is comparable to a normal telephone call between two people. Because there are no separate components in sequence, the latency that emerges with stitching disappears entirely, meaning callers rarely notice directly that they are speaking with an AI.
Yes. Realtime speech models are trained multilingually, allowing them to switch between Irish, English, German and other languages within the same conversation without needing you to configure that switch beforehand. For businesses with an international customer base, this eliminates an entire configuration step.
We build a fallback route into each project, so the call automatically transfers to a staff member or to a recorded message if the model fails. The caller only notices the call being transferred, which means your call flow remains intact even if there is a disruption on the supplier’s side.
Yes. We build the voicebot so that audio and metadata remain within the European Union and we have a data processing agreement in place with all parties involved. For regulated sectors such as healthcare, banking and insurance, we also provide a self-hosted variant that runs entirely behind your own firewall.
