BUILT FOR STREAMING AND AGENTS

Streaming TTS that starts from your first LLM token

Clone any voice from 6s of audio. Speak 8 languages with native accent. $0.015/min — up to 5x cheaper than ElevenLabs Multilingual. WebSocket API, Python & TypeScript SDKs.

Try Playground

Get API Key

Read Docs

4 lines to streaming TTS

        
        
        
      
4 lines to streaming TTS
from palabra import PalabraClient
client = PalabraClient("pk_live_...")
voice = client.clone_voice("ref.wav")          # 6s audio → cloned voice
async for chunk in client.stream(text=llm, voice=voice, language="de"):
    player.play(chunk)                   # ~100ms to first audio

from palabra import PalabraClient 
client = PalabraClient("pk_live_...") 
voice = client.clone_voice("ref.wav") # 6s audio → cloned voice async for chunk in client.stream(text=llm, voice=voice, language="de"):  player.play(chunk) # ~100ms to first audio

DROP-IN COMPATIBLE WITH

Vapi

Retell

LiveKit

Pipecat

Vocode

Other TTS APIs wait for a full sentence. We don't.

Traditional TTS buffers your entire LLM response before synthesis starts. That's 300–800ms of dead air your user hears as lag. Palabra begins synthesizing from just 2 words, while the LLM is still generating.

This is the single biggest improvement you can make to perceived voice agent speed — and only Palabra offers it.

Palabra starts synthesis after 2 words.  Others wait for the full sentence.

LLM

Tokens generating…

Others

Waiting… then audio

~500ms

Palabra

Audio from word 2

~100ms

Purpose-built for real-time voice

Not a general-purpose TTS. Built for where latency, naturalness, and unit economics all matter at once.

VAPI • RETELL • LIVEKIT

Voice Agents

Streaming input eliminates text buffering — the #1 source of perceived lag. Clean barge-in handling. Your agent responds in real-time.

Μ-LAW • PCM • G.711

Telephony & IVR

Native G.711 μ-law for direct SIP trunk integration. Sub-100ms latency keeps calls flowing. Clone your brand voice once, use across every region.

DEACCENTING • 8 LANGUAGES

Multilingual Content

One speaker, 8 languages, zero foreign accent. Your German content sounds German. That used to require 8 voice actors.

CAPABILITIES

Next-Gen AI TTS Engine: What You Can't Get Anywhere Else

Two exclusive features plus the fastest, cheapest voice cloning in the market.

Real-Time Deaccenting

★ Only Palabra

Clone a voice in English, speak in German — it sounds like a native German speaker. The voice identity stays. The foreign accent disappears. No other provider does this.

True Streaming Input — From 2 Words

★ Only Palabra

Pipe LLM tokens directly into our WebSocket. Audio starts generating before the sentence finishes. Every other provider buffers a full sentence. This eliminates 300–800ms of text buffer wait.

6-Second AI TTS Voice Cloning

ElevenLabs needs 60+ seconds. Cartesia doesn't offer cloning. We clone from 6 seconds via dual conditioning — discrete codes for the LLM, continuous embeddings for the decoder.

$0.015/minute. Transparent.

2.5x cheaper than ElevenLabs Flash. 5x cheaper than Multilingual. No concurrency caps. No tiered latency penalties. Voice cloning included.

Production-Grade Streaming

Full-duplex WebSocket. PCM, Opus, MP3, μ-law output. Built-in speed control. Clean barge-in — cancel/flush with zero trailing artifacts. 99.9% SLA.

~100ms TTFA

First audio chunk in ~100ms (P90). Competitive with Cartesia Sonic-3 at a lower per-character price. 2x faster than Deepgram.

Built for production from day one

Try Playground

Get pricing

DEVELOPER-FIRST

Ship in an afternoon

WebSocket API. Python & TypeScript SDKs. Drop-in compatible with the platforms you already use. Average integration time: 40 minutes.

VAPI

RETELL

LIVEKIT

PIPECAT

from palabra import PalabraClient
client = PalabraClient("pk_live_...")

# clone any voice from 6s audio
voice = client.clone_voice("ref.wav")

# stream TTS – starts at 2 words
async for chunk in client.stream(
    text=llm_token_stream,
    voice=voice,
    language="de",          # native accent
    format="opus",
):
    player.play(chunk)       # ~100ms TTFA

import { PalabraClient } from 'palabra'
const client = new PalabraClient("pk_live_...")

// clone any voice from 6s audio
const voice = await client.cloneVoice("ref.wav")

// stream TTS – starts at 2 words
for await (const chunk of client.stream({
    text: llmTokenStream,
    voice: voice,
    language: "de",        // native accent
    format: "opus",
})) {
    player.play(chunk)       // ~100ms TTFA
}

# clone voice
curl -X POST https://api.palabra.ai/v1/voice/clone \
  -H "Authorization: Bearer pk_live_..." \
  -F "[email protected]"

# stream TTS
curl -X POST https://api.palabra.ai/v1/tts/stream \
  -H "Authorization: Bearer pk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "text": "llm_token_stream",
    "voice": "voice_id",
    "language": "de",
    "format": "opus"
  }'

Transparent TTS Pricing: How Palabra Stacks Up

We show where we win and where others lead. You decide what matters.

Streaming Input

★ From 2 words of sentence

Full sentence

Partial

Realtime Deaccenting

★ Yes

Voice Cloning

6 seconds

60s+

Yes

Price / 1K chars

$0.02

$0.05–$0.10

~$0.030

$0.030

Price / Minute*

~$0.015

~$0.04–$0.08

~$0.027

~$0.023

TTFA (P90)

~100ms

~150ms

90ms

~200ms

Languages

70+

Perceptual Quality

Optimized for speed & business fit

Optimized for natural-sounding voice

Good

See what you'd save

Drag the slider to your monthly volume. Based on published API rates, not subscription estimates.

Monthly minutes

50,000

$750

$1,150

$1,350

$3,750

You save $3,000/mo vs ElevenLabs Multilingual — that's 80%

Start building today.

Get API key

FROM OUR USERS

Developers ship faster with Palabra

"Switched from ElevenLabs Multilingual. TTS bill dropped 80%, latency is comparable, and streaming input eliminated our text buffering step entirely."

Voice Agent Startup

Series A • Bay Area

"Deaccenting is magic. We serve 6 EU markets with one cloned brand voice and each market thinks it's local. Replaced 6 voice actors."

Enterprise SaaS

EU • Multilingual support

"Integration was 40 minutes. Swapped out Deepgram, TTFA dropped from 210ms to 98ms. Vapi plugin made it trivial."

Contact Center Platform

Vapi integration

Eight languages. Each trained to near-native.

55.4% WER improvement over baseline via GRPO training.

English

German

Spanish

French

Italian

Dutch

Portuguese

Russian

Hindi

Mandarin

Korean

Japanese

Built for production. Certified for compliance.

Cloud, self-hosted, or on-premises.

ISO 27001

Certified infrastructure

GDPR

EU data processing, DPA ready

99.9% SLA

Status page & incident response

Zero Training

Your data never trains our models

FAQ About Our Neural TTS

How does voice quality compare to ElevenLabs?

ElevenLabs optimizes for studio-quality English voice. Palabra is optimized for production voice agents — lowest latency, multilingual cloning with native accent, and transparent pricing.

Can I self-host or deploy on-premises?

Yes. Palabra TTS is available as a cloud API, self-hosted deployment, or on-premises installation. All options are ISO 27001-certified and GDPR-compliant. Your data never trains our models. Contact sales for self-hosted and on-premises options.

How fast is integration?

Average integration time is 40 minutes.

What does "streaming input" actually mean?

Every other TTS provider requires a complete sentence before synthesis starts — that buffer adds 300–800ms of dead air your users hear as lag. Palabra accepts LLM tokens as they stream and begins audio synthesis after just 2 words, while the LLM is still generating. At 35ms time-to-first-audio, Palabra is the fastest TTS in the world.

How does voice cloning work, and how much audio do I need?

Just 6 seconds of reference audio. We use dual conditioning — discrete codes for the LLM and continuous embeddings for the decoder — to capture both voice identity and natural prosody without fine-tuning. ElevenLabs requires 60+ seconds; Cartesia doesn't offer cloning at all.

What is deaccenting and why does it matter?

Most TTS engines carry the phonetic habits of a cloned voice into every language it speaks — so an English-cloned voice sounds foreign in German, even if the words are correct. Deaccenting separates voice identity (timbre, resonance, prosodic character) from accent, so the output sounds like a native speaker of the target language while still sounding like the original person.

What does it cost?

$0.015 per minute, voice cloning included. No concurrency caps, no tiered latency penalties at volume. That's up to 5x cheaper than ElevenLabs Multilingual and 2x faster than Deepgram.