Research

Why We’re Researching a Fast-Path System for Live Interpretation

strakeradmin · February 26, 2026 · 5 min read

Share

Contents

A Fast-Path Architecture
Early numbers
What we're learning
What's next
Wrapping up

Real-time interpretation is one of those problems that sounds solved until you actually try to build it for production.

When we started looking at building Verify Live, our live interpretation system, the obvious approach was to reach for the latest conversational AI APIs. OpenAI Realtime, Gemini, the usual suspects. These are impressive systems, and they’re getting better every month.

But here’s what we’re finding: conversational AI is designed for conversation, not interpretation. Sounds obvious when you say it out loud.

These systems are built around turn-taking. You speak, the AI responds. That’s great for chatbots and voice assistants, basic translation. It’s not great for simultaneous interpretation, where a presenter speaks continuously for minutes at a time and listeners need to hear the translation in near real-time.

The latency numbers are the problem. General-purpose LLM pipelines average 1,800ms to 3,500ms from speech-end to audio-out. For a live conference, that’s too much. Listeners fall out of sync with the presenter and the experience falls apart.

So we’ve been exploring a different approach.

A Fast-Path Architecture

Rather than looking for a better LLM, we’re trying to treat live interpretation as its own problem. One that needs a purpose-built pipeline rather than a general-purpose one.

The approach we’re testing is built around three ideas:

1. Stream everything

Traditional approaches wait for complete sentences before processing. You finish speaking, the system translates, then synthesises speech. Each stage waits for the previous one.

We’re trying not to wait.

Audio streams continuously. When we detect a natural sentence boundary, translation starts. When translation finishes, speech synthesis starts. No batching, no queuing.

2. Use natural boundaries, not artificial ones

Early prototypes used fixed-time chunking – process audio every 2.5 seconds regardless of content. The results were predictably bad. Sentences got cut mid-word. Translations lost context. The output sounded robotic.

What’s working better is getting our recognition system to tell us when a sentence ends naturally. Humans pause at sentence boundaries. Punctuation has acoustic signatures. By listening for these cues, we can trigger translation at points that make linguistic sense.

For continuous speech without pauses (some presenters really don’t breathe much!), there’s a fallback timeout. But it’s a safety net, not the primary mechanism.

3. Use the right model for the job

General-purpose LLMs optimise for flexibility. They can do many things reasonably well. But live interpretation doesn’t need flexibility – it needs speed and accuracy on a specific task.

We’re using optimised translation rather than general LLMs for the translation layer. The quality difference is negligible for straightforward interpretation, but the latency difference is big: ~100ms versus ~800ms per translation.

The heavier models can be saved for cases that actually need them, like complex technical content or nuanced business communication.

Early numbers

In testing, the fast-path approach is hitting 600ms to 950ms end-to-end latency. That’s from the moment a presenter finishes a sentence to when listeners hear the translated audio.

For context, that’s roughly satellite TV broadcast latency. It’s early, and these numbers will move as we push on edge cases, but they’re encouraging.

Consistency matters too. Latency spikes kill live interpretation faster than high average latency. A system that’s usually fast but occasionally takes 3 seconds is worse than one that’s reliably at 1.2 seconds. The pipeline is designed around predictability.

What we’re learning

A few things are becoming clear as we work through this:

Task-specific beats general-purpose for real-time systems. General LLMs are valuable for exploration and prototyping. But when you know exactly what task you’re solving, a purpose-built pipeline will outperform a general solution. Not a criticism of LLMs — just different problems need different tools. Pretty similar outlook from when we built our first Tiri SLMs.

Streaming changes how you think about, and handle data. The “input → process → output” model is deeply ingrained. For real-time applications, you need to think in terms of continuous flow. Data should be moving through the pipeline constantly, not sitting in queues.

Latency is the product. It’s tempting to treat latency as just another metric. For live interpretation, latency directly determines whether the thing works at all. A 2-second delay isn’t “slower” — it’s a different user experience. This means we are constantly asking ourselves: “well, did this update affect our latency?”

The boring parts matter most. The exciting parts are the AI components — speech recognition, translation, synthesis. But the performance comes from the unglamorous work: efficient audio streaming, buffer management, minimising unnecessary data copies. The pipeline architecture, and how the AI fits into it, is what makes sub-second latency possible.

What’s next

We’re exploring Verify Live initially for conferences, phone systems, and live broadcasts. The architecture also lends itself to a few other things we’re looking at:

Bidirectional conversation is working in early tests — two people speaking different languages, each hearing the other in their native tongue with minimal delay. Same streaming principles, just two pipelines running in opposite directions.

The design is transport-agnostic. Browser-based WebSocket for now, but the core engine doesn’t care how audio arrives. Phone integration, video conferencing, IVR systems — they’re all possible without rebuilding the interpretation pipeline. We’ll see where it goes.

Wrapping up

Live interpretation isn’t a solved problem. What we’re finding is that it needs purpose-built solutions (with AI) rather than general-purpose AI bolted onto existing infrastructure.

The fast-path approach is working because it’s designed around the actual constraints: continuous audio, natural language boundaries, and latency measured in hundreds of milliseconds, not seconds.

Sometimes the best AI solution isn’t the most sophisticated model. A statement that keeps coming up as we build more models and AI solutions.

strakeradmin

All posts