The way an AI voicebot processes speech determines whether it sounds natural or sluggish and unreliable. There are currently two architectures in use amongst builders, and the choice between them has consequences for the call experience, the reliability of the system and what the bot can extract from a conversation. The older approach links three separate components in sequence, whilst the newer method processes the entire conversation in one go.

Diagram with three stacked components on the left and one realtime speech model on the right

The classical approach: stitching

When the first voicebots were built, it made sense to link three existing components together. Incoming speech went through a speech-to-text engine that converted it to text, then a language model read that text and formulated a response, and finally a text-to-speech engine converted that response back into audible speech. This architecture is called “stitching” in the industry, because you chain three independent systems together into one pipeline.

For a time, that delivered usable results, and for teams that didn’t want to train their own speech model it was the only practical route. Yet in practice three vulnerabilities emerge, because each link in the chain can fail independently. Speech recognition can mishear a sentence, the language model can give a slow or incorrect response and voice synthesis can go down at an inopportune moment. Many teams therefore build in a backup with an alternative TTS or LLM supplier, so the bot continues working during an outage. That solves the downtime, but callers suddenly hear a completely different voice and become confused about whom they are actually speaking with.

The second drawback may weigh even more heavily. With stitching, the language model only sees a textual transcript, which means it cannot perceive the tone, volume, hesitation and emotion of the caller. An irritated customer and a satisfied customer sound identical to the model once their words appear on paper, and that comes at the cost of the contextual sensitivity that makes a conversation valuable. Signals about perceived age, native language or mood are lost in the translation to text, yet those signals are precisely what often determines how a staff member would conduct a call.

The new approach: one realtime speech model

Since OpenAI made gpt-realtime-1.5 available on 24 February 2026, there is a second way to build voicebots that works better in most cases. Instead of three separate components in sequence, one model listens and speaks directly, which means the entire layer of transcription and synthesis disappears. The model understands the words, the tone and the emotion of the caller at the same time, so it can respond to those directly in its answer. How smoothly that works in practice is shown well by a demo from Charlierguo.

That delivers concrete advantages in everyday use. There is now only one point where something can go wrong instead of three, which means the risk of downtime falls significantly. Response time typically sits below 400 milliseconds, so the conversation flows naturally without the latency that emerges with stitching. Multilinguality is built in, which means the same model switches effortlessly between Irish, English, German and other languages without you needing to configure that switch beforehand. And because the model processes audio instead of text, it recognises an irritated customer by their voice and can transfer them directly to a staff member without needing a keyword or explicit escalation for that.

When stitching is still the right choice

There remains a niche where the older architecture fits better, and those are situations where no live conversation needs to take place but rather a recording is analysed afterwards. When a call centre wants to have conversations summarised, coded or screened for compliance after the call has ended, there is no latency requirement and you can safely choose a specialised language model. Think of a medical language model that recognises abbreviations and specialist terminology in healthcare, or a speech-to-text engine trained specifically on a regional dialect. The precision on that one component outweighs the overall call experience in those scenarios, because there is no caller on the line waiting for an answer.

Our recommendation

For businesses that want to have live conversations handled by a voicebot, we advise the realtime approach in virtually all cases. The combination of faster response, lower vulnerability to outages, multilinguality without configuration and sensitivity to tone results in a call experience that callers do not perceive as robotic. For post-call analyses and other scenarios where precision on one specific component is decisive, we continue to deploy stitching architectures, because they still deliver the strongest result there.

Our team builds in both architectures

CallFactory builds voicebots in both architectures, depending on what fits best with your call flow. Whether you want a fully managed solution where our team sets up everything from start to finish, or you prefer a dedicated IVR on your own infrastructure, we deliver GDPR-compliant implementations that are available 24 hours a day, seven days a week.

Get in touch with our team to discuss which architecture fits your calls, how the integration with your existing systems works and within what timeframe the voicebot can go live. That way you get a clear estimate of the implementation timeline and the investment, and from day one you can have incoming and outgoing calls handled by a voicebot that speaks and listens at a level that was unthinkable until recently.