If you’ve been building AI agents lately, you’ve probably noticed something interesting: everyone is talking about PydanticAI, LangChain, LlamaIndex: but almost nobody is talking about how to add voice capabilities without coupling your entire architecture to a single speech provider, and that’s a big problem if you think about it for a moment.
We at Sayna have been dealing with this exact challenge, and I wanted to share some thoughts on why the abstraction pattern matters more than which framework or provider you choose.
The real problem with Voice Integration
Describe the situation, for example: You have an AI agent with text inputs and outputs perfect fine - maybe you use PydanticAI because you like typing safety - or LangChain because your team already knows it - or maybe you built something custom because existing frameworks didn’t fit your use case - it all works great
BUT someone then asks “Can we add voice to this?” and suddenly you are dealing with a completely different world of problems.
The moment you start integrating voice, you’re not just adding TTS and STT: you’re adding latency requirements, streaming complexity, provider-specific APIs and a whole new layer of infrastructure concerns.
Most of the developers to whom I talk make the same mistake: they pick a TTS provider (say ElevenLabs because it sounds good), a STT provider (maybe Whisper because it’s from OpenAI), wire them directly into their agent, and call it done. Six months later, they realize that ElevenLabs pricing won’t work for their scale or Whisper latency is too high for real-time conversations and now they have to rewrite significant parts of their codebase.
This is exactly the vendor lock-in problem we’ve seen in the past with Cloud providers, and it’s happening again with AI Services: just faster this time.
Why Framework Agnosticism Matters
Here’s something that might surprise you: PydanticAI, LangChain and LlamaIndex have all different approaches to handling voice, but none of them really solve the abstraction problem at the voice layer. They abstract LLM calls beautifully, but when it comes to speech processing, you’re most on your own.
The approach of LangChain is to let you chain together components: you bring your own STT function, your own TTS function and join them into a sequential chain: that’s flexible, but it puts the abstraction burden on you: each time you want to switch providers you change the chain logic.
PydanticAI has yet to have native voice support (there’s an open issue about it), which means that developers are building custom solutions on top - again your responsibility is the abstraction.
The point is not that these frameworks are bad: they are excellent at what they do: the point is that voice is a different layer and treating it as just another tool in your agent toolkit is missing the bigger picture.
The Abstraction Pattern You Actually Need
When I think about voice integration for AI agents, I think about it in three distinct layers:
Your Agent Logic: This is where PydanticAI, LangChain or your custom solution lives. It handles thinking, tool calls, memory and all the intelligent parts. This layer should not know anything about audio formats, speech synthesis or transcription models.
The Voice Abstraction Layer: This sits between your agent and the actual speech providers, handles the complexity of streaming audio, managing web socket connections, managing the voice activity detection, and most importantly: abstracting provider-specific APIs behind a unified interface.
Speech Providers: These are your actual TTS and STT services: OpenAI, ElevenLabs, Deepgram, Cartesia, AssemblyAI, Google, Amazon Polly… the list goes on. Each has different strengths, pricing models, latency characteristics and API quirks.
The key insight is that your agent logic should talk to the voice abstraction layer, never directly to providers, in this way the switch from ElevenLabs to Cartesia becomes a configuration change, not a code rewrite.
Real-world considerations
Until you build a few voice agents, I will share some things that are not obvious.
Latency stacking is real. When you chain STT → LLM → TTS, every millisecond adds up. Users notice when the response time goes above 500ms. That means you need the TTS synthesis at every stage, not just at the LLM level. Your voice layer needs to start TTS synthesis before your agent finishes generating the entire response. This is not trivial to implement with direct provider integration.
Provider characteristics vary wildly. Some TTS providers sound more natural but have higher latency. Some STT providers handle accents better but struggle with technical terminology. Some work great on high-quality audio but fall apart over phone lines (8kHz PSTN audio is very different from web audio) Having the ability to swap providers without code changes is not just nice, it’s essential for production systems.
Voice activity detection is harder than looks. Knowing when a user starts and stops speaking, handling interruptions gracefully, filtering background noise: these are solved problems but only if you’re using the right tools. Building this from scratch while also building your agent logic is a recipe for burnout
The Multi-Provider Advantage
This is why I’m passionate about this topic: If you design your voice integration with multi-provider support from the first day, you gain several advantages that are initially unsure.
Cost optimization becomes possible. Different providers have different pricing models: some charge per character, some per minute, some have volume discounts. When you can switch providers easily, you can move different types of conversations to different providers based on cost efficiency.
Reliability improves. Providers have outages. If your voice layer supports multiple providers, you can implement fallback logic. ElevenLabs is down? Route to Cartesia. Deepgram is slow? Fall back to Whisper. This is impossible when the agent code is tightly connected to specific provider APIs.
Quality testing becomes easier. Want to compare how your agent sounds with different TTS voices? With proper abstraction, you can A/B test providers in production without deploying new code.
Future-proofing. The speech AI space is moving fast: new providers appear, existing ones add features, pricing changes. A well-abstracted voice layer lets you adopt new technology without rewrites.
What Does Good Abstraction Look Like?
A good voice abstraction layer should handle several things without getting into specific code:
It should provide a single API for TTS that works the same regardless of which provider you’re using underneath, for example you shouldn’t need to know whether you’re using OpenAI or Deepgram when you call the transcription function.
It should handle streaming natively: both audio input and output should stream, not wait for complete chunks, this is critical for real-time feel.
It should manage connection lifecycle: Websocket connections, session management, authentication tokens: all this should be handled by the abstraction layer, not your agent code.
It should provide the voice activity detection as a first-class feature when users begin speaking, when they start to pause: this should work out of the box.
And most importantly it should integrate cleanly with existing agent frameworks without requiring you to rewrite your agent logic.
The trap of “We’ll Abstract Later”
Many times I’ve seen this pattern: Teams say, “Let’s just integrate OpenAI directly for now, we’ll add abstraction later when we need it.” Here’s the thing - you never find time for “later.” Your direct integration works, you ship features on top of it and before you know it, provider-specific assumptions are baked into multiple layers of your codebase.
By the time you realize that you need to switch providers (and you will), the cost of refactoring is significant - it’s the same story as technical debt everywhere else in software engineering - but with AI voice integration, coupling goes faster because you’re dealing with real-time streaming, binary audio data and complex session management.
Starting with the right abstraction pattern from day one is not premature optimization: it’s responsible engineering.
Conclusion
Building voice capabilities into AI agents is not just about picking the right TTS and STT providers, it’s about designing an architecture that keeps your agent logic clean, your provider integrations swappable and your future options open.
Whether you use PydanticAI because you love Type Safety, LangChain because of its ecosystem, LlamaIndex for RAG-heavy applications or rolling out your own custom agent - the voice layer should be a separate concern with proper abstraction.
The AI voice space is rapidly evolving: new models appear monthly, price changes quarterly and what’s best today might not be better tomorrow. The teams that will succeed are those building for flexibility and not for today’s provider landscape.
When you start a voice agent project today, think first about abstraction, and provider selection second; your future self will thank you
If you’re interested in how we solve this at Sayna, check out our docs at docs.sayna.ai. We’ve built exactly this sort of unified voice layer that works with any AI Agent framework.
Let me know what you think about this approach: always happy to discuss architectural patterns
