Table of Content

Curious About Superior Communication?

Partner with Our Skilled Developers!

Best STT/TTS Models for Connecting to Asterisk in 2026: Latency, Benchmarks & Cost Compared

Published on: August 6, 2024

Last updated:
May 5, 2026

Ruchir Brahmbhatt

📝 Quick Summary

The AI voicebot landscape shifted dramatically in 2025–2026. This guide cuts through the noise, covering the latest TTS and STT models, real-world latency and accuracy benchmarks, integration methods (AGI, ARI, Audio Socket), and the actual cost of running an Asterisk AI voicebot at scale.

Whether you’re searching for the best STT, the best TTS, or evaluating combined STT and TTS models for Asterisk text to speech and transcription, this guide covers what’s actually production-ready in 2026.

If you’re still reading about Google STT and Amazon Polly as if it’s 2022, you’re building on a map that’s already out of date.

The Asterisk AI voicebot stack evolved faster in the last 18 months than it did in the five years before that. Deepgram Nova-3, Eleven Labs Flash v2.5, Cartesia Sonic, and OpenAI’s Realtime API rewrote what “low latency” means for telephony. AudioSocket replaced AGI scripts as the go-to integration path for real-time AI pipelines. Speech-to-speech (STS) models collapsed the STT-TTS pipeline into a single API call.

If you’re building a real-time AI voicebot on Asterisk, or evaluating whether to, this is the model selection guide you actually need in 2026.

How Has the Asterisk AI Voicebot Stack Changed Since 2024?

The Asterisk AI voicebot stack in 2026 runs on two architectures: modular STT and TTS models (sub-500ms round-trip latency) and speech-to-speech (STS) systems like OpenAI Realtime that collapse the entire chain into a single API call. AudioSocket has replaced AGI scripts as the standard integration path.

The 2024 stack is already obsolete. Components got dramatically better, a new integration layer emerged, and your model choices, integration method, and latency targets all depend on understanding where things stand today.

Here’s how the two dominant architectures break down:

STT + LLM + TTS Pipeline (Modular): Still the standard for most production deployments. Asterisk captures audio → STT converts it to text → LLM generates a response → TTS converts the response to audio → audio plays back. The improvement isn’t the pattern — it’s the components. Models like Deepgram Nova-3 (sub-300ms STT) and ElevenLabs Flash v2.5 (75ms TTS) have pushed total round-trip latency below 500ms for well-optimized setups, compared to 1–2+ seconds with 2024-era models.
Speech-to-Speech (STS): OpenAI Realtime API and Google Gemini 2.5 Live bypass the modular pipeline entirely — they receive audio in, process it, and stream audio out without ever converting to text. This reduces architectural complexity and latency, but trades off flexibility (you can’t swap the LLM independently, and per-minute costs run higher at scale).
Audio Socket is now the integration method of choice for real-time AI pipelines. Unlike AGI or ARI, it streams raw PCM audio over a persistent TCP connection; your Python or Node.js service receives and sends audio frames continuously, with no HTTP round-trips or file handoffs.

The right architecture depends on your latency requirements, call volume, and whether you need component-level control, something any experienced asterisk developer will help you nail before you commit to a stack. The rest of this guide is designed to help you answer these questions.

Which Integration Method Should You Use to Connect Text to Speech Asterisk Services, AGI, ARI, or Audio Socket?

Asterisk offers three integration methods for connecting to STT and TTS models: AGI (file-based, high latency), ARI (REST/WebSocket, medium latency), and AudioSocket (streaming PCM over TCP, lowest latency). For real-time AI voicebots, AudioSocket is the only viable option.

Choosing the wrong method can make even the best STT or Asterisk TTS model feel broken, because the bottleneck isn’t the model, it’s the handoff.

Each method reflects a different design assumption about what Asterisk is connecting to, and they have fundamentally different latency profiles:

Method	Best For	Latency Profile	Notes
AGI	Simple IVR, batch transcription	High (file-based)	Mature, well-documented. Poor fit for real-time streaming.
ARI	Call control, programmatic routing	Medium	Full call control via REST/WebSocket. Used for the ExternalMedia channel setup.
AudioSocket	Real-time AI voicebots	Lowest	Streams 8kHz 16-bit PCM in 20ms frames over TCP. Purpose-built for AI pipelines.

For any real-time AI voicebot deployment, AudioSocket is the right choice. It delivers audio in 20ms frames over a persistent TCP connection, the only approach that keeps pace with sub-300ms best STT models. AGI remains appropriate for voicemail transcription or post-call processing where response time isn’t a constraint.

If your infrastructure is already running older dial-plan logic, you can modernize legacy Asterisk using ARI without rewriting everything before making the jump to Audio Socket.

If you’re building a new Asterisk AI voicebot in 2026, start with AudioSocket. Everything else in this guide assumes you’re streaming audio rather than batching it.

What Are the Best STT Models for Asterisk in 2026? Top Speech-to-Text Picks

The top best STT models for Asterisk in 2026 are Deepgram Nova-3 (best for real-time, sub-300ms), OpenAI gpt-4o-transcribe (best accuracy for batch), AssemblyAI Universal-2 (enriched transcription with built-in intelligence), Whisper Large V3 Turbo (open-source, on-premise), and Google Cloud Chirp 2 (broadest language coverage).

STT is when your voicebot earns or loses a caller’s trust in the first 5 seconds. Get a word wrong, a name, an account number, a service type, and the rest of the conversation is damage control. The challenge isn’t finding good STT and TTS models. It’s matching the right one to your deployment constraints: latency budget, language mix, domain vocabulary, and cloud vs. on-premise infrastructure.

Here’s how the leading options compare:

1. Deepgram Nova-3

The default choice for real-time Asterisk deployments in 2026. Nova-3 delivers sub-300ms streaming latency, supports real-time multilingual transcription across 30+ languages, and includes live vocabulary injection (keyterm prompting) for domain-specific terminology, useful for telecom, healthcare, and financial services use cases. It also ships with Deepgram Flux, a dedicated turn-detection model that predicts when a caller has finished speaking, reducing the awkward silence gap that plagues basic VAD-based implementations.

Best for: Real-time AI voicebots, contact centers, high-volume concurrent calls.

2. OpenAI gpt-4o-transcribe

Released in March 2025, gpt-4o-transcribe delivers lower WER than Whisper across most benchmark datasets, with measurably stronger performance on accented speech and noisy environments. Latency is higher than Deepgram’s, and the model isn’t optimized for real-time streaming to the same extent. Cost-per-minute is also among the higher tiers in the market. This positions it well for accuracy-critical workflows, post-call transcription, compliance logging, and quality assurance, rather than live conversation, where response speed matters.

Best for: High-accuracy batch transcription, compliance workflows, post-call analytics.

3. AssemblyAI Universal-2

Universal-2 competes on the accuracy-first end of the market and stands out for what it bundles alongside raw transcription. The model supports 99+ languages and ships with integrated speech intelligence, sentiment analysis, PII detection, and speaker diarization, all of which are returned in the same API response. For teams already thinking about Asterisk CRM integration, Universal-2’s built-in sentiment analysis and PII detection considerably reduce downstream processing steps.

Best for: Analytics platforms, multi-speaker call recording, and enriched transcription pipelines.

4. Whisper Large V3 Turbo (Open-Source)

Whisper remains the foundation of most on-premise STT deployments. The V3 Turbo variant (released October 2024) delivers roughly 5x the speed of the original model through architectural optimization, making it far more viable for near-real-time applications on adequate GPU hardware. Self-hosted, it offers strong batch accuracy and zero per-minute API costs, but real-time streaming requires additional engineering via wrappers such as faster-whisper or whisper.cpp, and GPU infrastructure introduces ongoing costs and maintenance overhead.

Best for: On-premise and air-gapped deployments, cost-controlled batch processing.

5. Google Cloud Chirp 2

Chirp 2 offers the broadest language coverage of any commercial STT option (100+ languages) and integrates tightly with the Google Cloud ecosystem. If your Asterisk infrastructure already runs on GCP, or if global multilingual coverage is a hard requirement, Chirp 2 is the path of least resistance. Accuracy and cost-per-minute are less competitive than Deepgram Nova-3 at equivalent quality tiers, and it’s not the right choice if latency is the primary optimization target.

Best for: GCP-native deployments and global, multilingual contact centers.

Model selection here comes down to one core question: are you optimizing for speed or accuracy? For real-time conversation, Nova-3 is the default. For post-call precision, gpt-4o-transcribe or Universal-2 are the stronger picks. For on-premise data sovereignty, Whisper V3 Turbo remains the benchmark across the best STT models available today.

What Are the Best Asterisk TTS Models in 2026? Asterisk Text to Speech Compared

The leading Asterisk TTS models in 2026 are Eleven Labs Flash v2.5 (best voice quality, 75ms TTFA), Cartesia Sonic (lowest latency at 40–95ms), Deepgram Aura-2 (unified STT+TTS stack), OpenAI gpt-4o-mini-tts (OpenAI ecosystem fit), and Azure Neural TTS (regulated industry compliance).

Asterisk text to speech is the last mile – what your caller actually hears. what your caller actually hears. A 2024-era TTS engine paired with a 2026 LLM is like a high-resolution screen with a broken speaker. The right Asterisk TTS choice today depends on whether you’re optimizing for voice quality, latency, or compliance constraints.

Here’s how the top options break down:

1. ElevenLabs Flash v2.5

The current benchmark for voice quality in real-time applications. Flash v2.5 achieves ~75ms time-to-first-audio, supports 70+ languages, and produces the most human-like synthesized speech available via API.

The voice catalog (4,000+ voices) and voice-cloning capability are unmatched in the commercial market and are useful for branded agent deployments where voice identity is part of the product experience. For customer-facing applications where voice quality directly affects user trust and CSAT, ElevenLabs Flash is the standard; everything else is compared against it.

Best for: Customer service bots, branded voice agents, any deployment where voice realism is a business requirement.

2. Cartesia Sonic / Sonic Turbo

Cartesia’s architecture is built on State Space Models (SSMs) rather than transformers, which enables latency numbers of 95ms TTFA on standard Sonic, with Turbo variants targeting 40ms.

Voice quality is strong for conversational use, though language support (15 languages) is significantly narrower than ElevenLabs. Instant voice cloning is available without tier restrictions. Cartesia is the right choice when raw latency is the primary optimization target and language breadth isn’t a constraint.

Best for: Ultra-low-latency voice agents, contact center deployments where response speed is the top priority.

3. Deepgram Aura-2

Purpose-built for the same production telephony environment as Nova-3. Aura-2 targets ~90ms optimized latency, supports 7 languages, and pairs natively with Nova-3 in a single-vendor pipeline, which simplifies authentication, billing, and support escalation. Voice quality is functional rather than best-in-class, but for teams who want a clean, maintainable stack without managing multiple vendor relationships, the tradeoff is often worth it.

Best for: Teams running Deepgram Nova-3 for the best STT who want a unified STT and TTS models vendor stack.

4. OpenAI TTS (gpt-4o-mini-tts)

Available in three variants — tts-1 (speed), tts-1-hd (quality), and gpt-4o-mini-tts (improved benchmark accuracy). The strongest case for OpenAI TTS is ecosystem cohesion: if your stack already uses GPT-4o for the LLM layer, keeping Asterisk text-to-speech within the same API ecosystem means a single authentication, unified billing, and a single support relationship.

Voice range is solid; emotional nuance lags ElevenLabs. Latency is acceptable for near-real-time use, but it doesn’t match Cartesia or ElevenLabs Flash for low-latency streaming.

Best for: OpenAI-native stacks, unified LLM + Asterisk TTS pipelines.

5. Amazon Polly / Google Cloud TTS / Azure Neural TTS

The hyperscaler options remain relevant for specific scenarios rather than being the default choice. Azure leads for regulated industries; it holds FedRAMP High, HIPAA, and DoD IL5 certifications and can be deployed as on-premise containers.

Amazon Polly fits AWS-native architectures where integration simplicity outweighs voice quality. Google Cloud TTS is well-suited to GCP environments and offers the widest language coverage (100+ languages) at competitive per-character pricing for standard voices. Neural voice quality from all three has improved meaningfully since 2024 and is production-viable — just no longer best-in-class for voice realism.

Best for: Cloud-native ecosystem alignment (GCP/AWS/Azure), regulated-industry compliance, and cost-sensitive, high-volume deployments.

The decision framework mirrors the STT side: if voice quality is your business differentiator, ElevenLabs. If latency is the constraint, Cartesia.

If you want to minimize vendor complexity, Deepgram’s unified stack is the cleanest option. The hyperscalers are the right answer when compliance requirements, existing cloud contracts, or language coverage needs take precedence over quality and latency comparisons.

How Do the Best STT Models Compare on Accuracy and Latency?

Word Error Rate (WER) is the primary accuracy metric for the best STT models, lower is better. Below 10% is production-ready; below 5% is high-accuracy; above 15% causes meaningful degradation in downstream LLM quality.

Benchmarks mean nothing without context. A 12% WER on clean English is very different from 12% on accented telephony audio with background noise – which is what Asterisk voicebots actually encounter. Real-world performance on your own call recordings is the only number that matters.

Here’s how the leading STT and TTS models compare on key STT production metrics:

Model	WER (Real-World)	Streaming Latency	Languages	Deployment
Deepgram Nova-3	5.26–12.8%	Less-then 300ms	30+	Cloud / On-prem
OpenAI gpt-4o-transcribe	8.9%	Moderate	50+	Cloud
AssemblyAI Universal-2	14.5%	Moderate	99+	Cloud
Whisper Large V3 Turbo	~10.6%	Seconds (self-hosted)	99	On-prem
Google Chirp 2	9.8%	Low-moderate	100+	Cloud

WER figures sourced from Artificial Analysis independent benchmarks (May 2025) and provider documentation. Numbers vary by audio quality, accent distribution, and domain vocabulary.

These numbers give you a starting shortlist, not a final answer. Deepgram Nova-3 leads on the latency-accuracy tradeoff for real-time streaming. OpenAI and Google are closer on pure accuracy for batch use cases. Always benchmark against your own call recordings — specifically against your most common failure modes (heavy accents, noisy environments, domain jargon), before committing to a model in production.

Should You Use a Speech-to-Speech Model Instead of a Separate STT and TTS Pipeline?

Speech-to-speech (STS) models like OpenAI Realtime API and Google Gemini 2.5 Live receive audio in and return audio out — no separate STT and TTS step. They reduce complexity and can hit sub-500ms latency, but offer less component-level control and cost more per minute at scale.

The modular TTS and STT models pipeline remains the standard for good reason, swap one component without breaking the others. But as STS models mature, the question of whether that modularity is worth the added complexity is legitimate.

Two platforms have mature STS integrations with Asterisk:

OpenAI Realtime API accepts bidirectional audio over WebSocket. Audio goes in via input_audio_buffer.append, the model handles VAD turn-taking server-side, and response audio streams back as output_audio.delta chunks, injected directly into the RTP stream via ARI’s ExternalMedia channel. The entire STT and TTS pipeline is replaced with a single WebSocket connection.
Google Gemini 2.5 Live offers similar STS capability and has documented integration paths with Asterisk via the AudioSocket channel and the open-source AVR (Agent Voice Response) infrastructure project.

STS reduces complexity and can improve latency, but you lose independent control over transcription, LLM, and Asterisk text to speech tuning. Per-minute costs at scale are also higher than those of a well-optimized modular stack.

For high-volume production contact centers weighing architecture decisions, understanding the best Asterisk solutions for business growth helps frame whether STS or a modular pipeline better fits your scaling goals.

How Much Does It Cost to Run an Asterisk AI Voicebot with Cloud STT and TTS Models?

Running an Asterisk AI voicebot on cloud STT and TTS models costs roughly $11–25 per 1,000 call minutes for budget-to-mid-tier stacks, and $100–200 for premium voice quality or STS models. Asterisk TTS is the dominant cost variable, quality tiers vary by up to 40x across providers.

Cost modeling is harder than it looks because TTS and STT models providers don’t use consistent billing units, best STT is priced per minute, Asterisk text to speech per 1,000 characters, STS per conversation minute. Comparing them requires normalization that most vendor pricing pages skip.

STT Pricing (per minute of audio):

Provider	Model	Price/min	Billing Type	Free Tier	Notes
Deepgram	Nova-3 (streaming)	~$0.0077	Per minute	$200 free credit on signup	Pay-as-you-go; volume discounts from $0.0055/min at 1M+ minutes
Deepgram	Nova-3 (batch)	~$0.0043	Per minute	$200 free credit on signup	Batch/pre-recorded only; not suitable for real-time Asterisk streaming
AssemblyAI	Universal-2	~$0.006	Per minute	~$50 free credit	~$0.10–0.15/hour; includes diarization, sentiment, and PII detection in the same call
OpenAI	gpt-4o-transcribe	~$0.006	Per minute	No free tier	Higher cost at volume; best for accuracy-first batch workflows
Google	Whisper API	$0.006	Per minute	No free tier	Batch only; no streaming support; fixed flat rate regardless of volume
Google	Chirp 2	$0.016	Per minute	60 mins/month free	Priced higher than Deepgram; best for GCP-native or multilingual deployments

TTS Pricing (per 1,000 characters):

Provider	Model	Price/1K chars	Billing Type	Free Tier	Notes
Google Cloud	Standard voices	$0.004	Per character	4M chars/month free	Lowest cost option; functional quality; best for cost-sensitive high-volume deployments
Amazon Polly	Neural	$0.016	Per character	1M chars/month free (12 months)	AWS-native billing; neural voices only; no real-time streaming TTFA advantage
Azure	Neural TTS	~$0.016	Per character	0.5M chars/month free	On-premise container available; HIPAA/FedRAMP compliant; 129 language locales
Deepgram	Aura-2	$0.015 (est.)	Per character	$200 free credit on signup	Purpose-built for telephony; pairs natively with Nova-3; simplified single-vendor billing
OpenAI	tts-1	~$0.015	Per character	No free tier	Speed-optimized; lower voice quality than tts-1-hd; good for latency-sensitive pipelines
OpenAI	tts-1-hd	~$0.030	Per character	No free tier	Quality-optimized; higher latency than tts-1; not recommended for real-time streaming
OpenAI	gpt-4o-mini-tts	Token-based	Per token	No free tier	Most accurate OpenAI TTS; priced per token, not character; costs vary with output length
Google Cloud	Chirp 3 HD	$0.030	Per character	4M chars/month free (standard only)	Premium tier; significantly better voice quality than standard; 100+ languages
Cartesia	Sonic	~$0.030	Per character	Limited free tier on signup	SSM-based architecture; 95ms TTFA; instant voice cloning included at all tiers
Cartesia	Sonic Turbo	~$0.030	Per character	Limited free tier on signup	~40ms TTFA; same price as Sonic; choose Turbo when latency is the primary constraint
ElevenLabs	Flash v2.5	~$0.18–$0.30	Per credit/character	10K chars/month on free plan	Most expensive but best-in-class voice quality; 4,000+ voices; voice cloning available

Real-World Cost Estimate: 1,000 Minutes of AI Voicebot Calls:

Using 1,000 call minutes as a baseline (3-minute average call = ~333 calls):

Budget stack – Deepgram Nova-3 STT + Google Cloud TTS Standard: STT: $7.70 | TTS: ~$3–6 | Total: ~$11–14 per 1,000 call minutes
Mid-tier stack – Deepgram Nova-3 STT + Deepgram Aura-2 TTS: Total: ~$20–25 per 1,000 call minutes (single vendor)
Premium stack – Deepgram Nova-3 STT + ElevenLabs Flash v2.5 TTS: STT: $7.70 | TTS: ~$90–150 | Total: ~$100–160 per 1,000 call minutes
STS stack – OpenAI Realtime API: Total: ~$100–200 per 1,000 call minutes

The numbers make the decision straightforward: if your deployment can absorb a higher per-minute Asterisk TTS cost in exchange for best-in-class voice quality, ElevenLabs Flash delivers it. If you’re optimizing for cost at volume, Deepgram’s unified stack or the Google Standard voice tier are the right levers. STS makes sense when it simplifies your architecture, not when you’re trying to cut costs.

(Pricing as of early 2026. Actual costs vary by call patterns, TTS output volume, and vendor tier. Always model against your specific usage before committing.)

Which Open-Source STT and TTS Models Work with Asterisk for On-Premise Deployments?

The leading open-source best STT models for on-premise Asterisk are Whisper Large V3 Turbo (best accuracy, requires GPU) and NVIDIA Canary/Parakeet TDT (top leaderboard WER). For Asterisk TTS, Coqui and Piper are the primary options, both also deployable as a TTS engine for Android in hybrid mobile deployments.

On-premise is a legitimate architectural path in 2026, not a workaround — especially for teams with data privacy requirements or high call volumes. Infrastructure typically breaks even vs. cloud APIs at 5,000–15,000 minutes/month, and open-source accuracy is now competitive with cloud alternatives in controlled conditions.

Here’s what’s worth evaluating:

STT:

Whisper Large V3 Turbo (OpenAI, MIT license) – Best open-source STT accuracy. Requires a GPU for real-time performance. Not natively streaming; requires a wrapper like faster-whisper or whisper.cpp for Asterisk AudioSocket integration.
NVIDIA Canary / Parakeet TDT – Currently tops the Hugging Face Open ASR Leaderboard with sub-6% WER. Requires NVIDIA GPU infrastructure.
Vosk — Lightweight, CPU-runnable. Lower accuracy than Whisper but viable for controlled-vocabulary IVR use cases or resource-constrained environments.

TTS:

Coqui TTS – Open-source, solid voice quality for English. Community-maintained. Also deployable as a TTS engine for Android in mobile companion apps.
Piper / Kokoro – Lightweight, CPU-runnable options used widely in local AI agent setups and edge device deployments, including as a TTS engine for Android.
Azure Neural TTS Container – Commercial, but deployable on-premise with full regulatory compliance documentation (HIPAA, FedRAMP).

The economics are straightforward: GPU cloud instances suitable for real-time Whisper run at $200–$ 800/month, depending on instance type, plus engineering overhead. At 5,000–15,000 call minutes per month, on-premises typically breaks even with cloud API costs. Below that threshold, cloud APIs are almost always the more efficient choice.

Wrapping Up

Picking the right models gets you halfway there. The other half is execution, AudioSocket integration, LLM orchestration, latency tuning, and a stack that holds up at scale. That’s where most deployments hit a wall.

Whether you’re choosing between the best STT options, evaluating TTS and STT models combined, or weighing speech-to-speech architectures, the model selection decision shapes everything downstream.

Hire VoIP Developer who specializes in building Asterisk AI voicebot pipelines from the ground up, with the kind of hands-on experience that saves months of trial and error. If the architecture in this guide is what you want to end up with, start here.

FAQs

Do you need separate STT and TTS models, or can one model handle both?

Traditionally yes, but STS models like OpenAI Realtime now handle both in one pipeline. The tradeoff is less flexibility and higher cost at scale. For most Asterisk deployments, separate STT and TTS models still win on economics.

What is a realistic end-to-end latency target for an Asterisk AI voicebot in 2026?

Under 500ms, sub-300ms STT, 50-150ms LLM, sub-100ms TTS. Anything above 800ms consistently registers as an unnatural pause.

How do STT models handle background noise and accents on telephony calls?

Deepgram Nova-3 is trained specifically on telephony audio. OpenAI gpt-4o-transcribe shows stronger robustness on accented speech. Always benchmark against your actual call recordings, not general-purpose datasets.

Which STT and TTS stack works best for non-English Asterisk deployments?

AssemblyAI Universal-2 and Google Chirp 2 lead on STT language breadth. Azure Neural TTS and Google Cloud TTS offer the widest TTS coverage. For multilingual contact centers, Google STT + Google TTS is the cleanest single-vendor option.

At what call volume does self-hosting STT and TTS become cheaper than cloud APIs?

Break-even is roughly 5,000–15,000 minutes/month. Below that, cloud wins. Above it, Whisper V3 Turbo + Piper or Coqui delivers meaningful savings, especially against premium stacks like Eleven Labs.

Published in: Asterisk