examples/multi-modal
Strong tier · 9 min build

Real-time voice agent in 80 lines (STT → LLM → TTS)

Three modalities, one key: Deepgram streaming transcription feeds Kimi K2.6, which streams into ElevenLabs — voice-to-voice with p50 latency under 650ms. Runs on the free Kimi tier through Apr 30.

9 min readpublished 2026-04-25category · Multi-modal
Speech-to-text to LLM to text-to-speech pipeline with total p50 latency under 650ms

A voice agent is the clearest showcase of a unified gateway. The pipeline has three completely different modalities — streaming speech, a text LLM, streaming speech synthesis — and most teams lose a week wiring them together with three SDKs, three invoices, and three rate-limit headaches.

On AIgateway it's eighty lines. One key, one SDK, three model slugs. The latency budget under 650ms end-to-end comes from streaming every stage: STT emits partial transcripts, the LLM streams tokens, the TTS receives tokens-as-sentences and speaks them in real time.

AIgateway keyPython 3.11+websockets · audio streampyaudio · mic + speaker
Note
The Kimi K2.6 trial covers the LLM step on the free tier through Apr 30, 2026. Deepgram and ElevenLabs bill pass-through at their published rates — under a quarter per 60-minute voice call for a typical workload.

Build it in four steps

  1. STEP 01

    Open the mic, stream to Deepgram

    Send raw 16-bit PCM from the mic to AIgateway's Deepgram-compatible streaming endpoint. Partial transcripts come back every ~100ms; treat each finalized utterance as a turn.

    import asyncio, websockets, pyaudio
    
    STT_URL = "wss://api.aigateway.sh/v1/audio/transcriptions/stream?model=deepgram/nova-3"
    
    async def listen(on_utterance):
        mic = pyaudio.PyAudio().open(rate=16000, channels=1, format=pyaudio.paInt16,
                                     input=True, frames_per_buffer=1024)
        async with websockets.connect(STT_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
            async def mic_loop():
                while True:
                    await ws.send(mic.read(1024, exception_on_overflow=False))
            asyncio.create_task(mic_loop())
            async for msg in ws:
                evt = __import__("json").loads(msg)
                if evt.get("is_final"):
                    await on_utterance(evt["transcript"])
  2. STEP 02

    Stream the transcript to Kimi K2.6

    Each utterance becomes a user turn. Stream the LLM response token-by-token so the TTS can start speaking before the LLM is done thinking.

    from openai import AsyncOpenAI
    
    client = AsyncOpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
    SYSTEM = "You are a helpful voice assistant. Keep replies under 3 sentences, conversational, no markdown."
    
    async def think(history: list[dict], on_token):
        stream = await client.chat.completions.create(
            model="moonshot/kimi-k2.6",
            messages=[{"role": "system", "content": SYSTEM}, *history],
            stream=True,
            extra_headers={"x-aig-tag": "voice.llm"},
        )
        async for chunk in stream:
            tok = chunk.choices[0].delta.content
            if tok:
                await on_token(tok)
  3. STEP 03

    Speak tokens with ElevenLabs Turbo

    Open a TTS websocket, push sentences as they form (flush on ". " / "? " / "! "). ElevenLabs streams audio back in real time; write it to the speaker.

    TTS_URL = "wss://api.aigateway.sh/v1/audio/speech/stream?model=elevenlabs/eleven-turbo-v3&voice=rachel"
    
    async def speak():
        buf = ""
        async with websockets.connect(TTS_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
            speaker = pyaudio.PyAudio().open(rate=24000, channels=1, format=pyaudio.paInt16, output=True)
    
            async def on_token(tok):
                nonlocal buf
                buf += tok
                if buf and buf[-1] in ".?!":
                    await ws.send(__import__("json").dumps({"text": buf}))
                    buf = ""
    
            async def drain():
                async for audio in ws:
                    speaker.write(audio)
    
            asyncio.create_task(drain())
            return on_token
  4. STEP 04

    Wire it together

    Hook the three stages up and you've got a full voice loop. Keep history in memory, add a hang-up keyword, ship.

    async def main():
        history = []
        on_token = await speak()
    
        async def on_utterance(text: str):
            print("user:", text)
            if text.strip().lower() in {"bye", "goodbye", "hang up"}:
                raise SystemExit
            history.append({"role": "user", "content": text})
            await think(history, on_token)
    
        await listen(on_utterance)
    
    asyncio.run(main())

Latency budget

The headline p50 of 640ms end-to-end breaks down as: 140ms STT final, 320ms first-token-from-Kimi, 180ms TTS first-audio. The three stages overlap because every link is streamed — first audio leaves the speaker before the LLM finishes thinking.

If you need tighter latency, the biggest win is a shorter system prompt (every system token is a flat cost on first-token time) and voice-tuned TTS settings. Kimi K2.6 is already comfortably fastest-in-class for voice workloads; swapping in GPT-5.4 typically adds 80-150ms to first-token time.

Add a tool belt

A voice agent with tools is where product value lives — "book me the flight," "check my calendar," "order the usual." Kimi K2.6 handles tool calling natively, so you can add handlers to the `think` step and route voice output around them.

Same session, same history, same key. The only change is a `tools=[...]` argument on the chat-completions call.

tools = [{
    "type": "function",
    "function": {
        "name": "lookup_order",
        "description": "Find a customer's last order.",
        "parameters": {"type": "object", "properties": {"email": {"type": "string"}},
                       "required": ["email"]},
    },
}]

# Pass tools into think() and handle tool_calls in the stream.

FAQ

Do I need Deepgram and ElevenLabs accounts?+

No. Your AIgateway key reaches both providers — the gateway bills their usage pass-through against your balance with a 5% platform fee. No second signup, no second invoice, no second rate limit to manage.

Can I swap voices?+

Change the `voice=` query param on the TTS URL. ElevenLabs' full voice library is reachable, and Cartesia, PlayHT, and Deepgram Aura are available with a provider prefix change.

What about on-device voice?+

The mic and speaker loops are local; only the three inference calls are network. For fully offline inference, swap to the Workers-AI Whisper (STT) and MeloTTS (TTS) slugs plus a local Kimi K2.6 Base or Llama 4.1 deployment — the Python code doesn't change.

How do I handle interruption?+

When STT fires a finalized utterance while TTS is still speaking, cancel the outstanding TTS write and start the new think() call. The example repo has the four extra lines in the github gist linked from the CTA banner.

Is this low-latency enough for phone calls?+

Yes for most use cases. Production phone systems usually sit at p50 500-800ms; our 640ms is in-band. For telco-level requirements (p50 under 400ms), pre-warm the TTS connection and pin the region to your carrier's PoP.

What does a real 60-minute voice call cost?+

Under $0.25 on typical settings — STT around $0.09, Kimi K2.6 free on trial (or ~$0.04 paid), ElevenLabs Turbo around $0.10. A cap on the voice tag is a good idea to prevent a runaway session.

Can I record the conversation for audit?+

Yes — Enterprise tier stores both the raw audio and the transcript with a signed URL. Free and Pro tiers get transcripts via `x-aig-store-transcript: true` and audio retention as a paid add-on.

READY TO BUILD?
Get an AIgateway key in 30 seconds. Free Kimi K2.6 through Apr 30, 2026; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples