Real-time voice agent in 80 lines (STT → LLM → TTS)

A voice agent is the clearest showcase of a unified gateway. The pipeline has three completely different modalities — streaming speech, a text LLM, streaming speech synthesis — and most teams lose a week wiring them together with three SDKs, three invoices, and three rate-limit headaches.

On AIgateway it's eighty lines. One key, one SDK, three model slugs. The latency budget under 650ms end-to-end comes from streaming every stage: STT emits partial transcripts, the LLM streams tokens, the TTS receives tokens-as-sentences and speaks them in real time.

AIgateway keyPython 3.11+websockets · audio streampyaudio · mic + speaker

Note

Your $5 signup credit covers Kimi K2.6 for the LLM step; Deepgram and ElevenLabs bill pass-through at their published rates — under a quarter per 60-minute voice call for a typical workload.

Build it in four steps

STEP 01

Open the mic, stream to Deepgram

Send raw 16-bit PCM from the mic to AIgateway's Deepgram-compatible streaming endpoint. Partial transcripts come back every ~100ms; treat each finalized utterance as a turn.

import asyncio, websockets, pyaudio

STT_URL = "wss://api.aigateway.sh/v1/audio/transcriptions/stream?model=deepgram/nova-3"

async def listen(on_utterance):
    mic = pyaudio.PyAudio().open(rate=16000, channels=1, format=pyaudio.paInt16,
                                 input=True, frames_per_buffer=1024)
    async with websockets.connect(STT_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
        async def mic_loop():
            while True:
                await ws.send(mic.read(1024, exception_on_overflow=False))
        asyncio.create_task(mic_loop())
        async for msg in ws:
            evt = __import__("json").loads(msg)
            if evt.get("is_final"):
                await on_utterance(evt["transcript"])

STEP 02

Stream the transcript to Kimi K2.6

Each utterance becomes a user turn. Stream the LLM response token-by-token so the TTS can start speaking before the LLM is done thinking.

from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
SYSTEM = "You are a helpful voice assistant. Keep replies under 3 sentences, conversational, no markdown."

async def think(history: list[dict], on_token):
    stream = await client.chat.completions.create(
        model="moonshot/kimi-k2.6",
        messages=[{"role": "system", "content": SYSTEM}, *history],
        stream=True,
        extra_headers={"x-aig-tag": "voice.llm"},
    )
    async for chunk in stream:
        tok = chunk.choices[0].delta.content
        if tok:
            await on_token(tok)

STEP 03

Speak tokens with ElevenLabs Turbo

Open a TTS websocket, push sentences as they form (flush on ". " / "? " / "! "). ElevenLabs streams audio back in real time; write it to the speaker.

TTS_URL = "wss://api.aigateway.sh/v1/audio/speech/stream?model=elevenlabs/eleven-turbo-v3&voice=rachel"

async def speak():
    buf = ""
    async with websockets.connect(TTS_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
        speaker = pyaudio.PyAudio().open(rate=24000, channels=1, format=pyaudio.paInt16, output=True)

        async def on_token(tok):
            nonlocal buf
            buf += tok
            if buf and buf[-1] in ".?!":
                await ws.send(__import__("json").dumps({"text": buf}))
                buf = ""

        async def drain():
            async for audio in ws:
                speaker.write(audio)

        asyncio.create_task(drain())
        return on_token

STEP 04

Wire it together

Hook the three stages up and you've got a full voice loop. Keep history in memory, add a hang-up keyword, ship.

async def main():
    history = []
    on_token = await speak()

    async def on_utterance(text: str):
        print("user:", text)
        if text.strip().lower() in {"bye", "goodbye", "hang up"}:
            raise SystemExit
        history.append({"role": "user", "content": text})
        await think(history, on_token)

    await listen(on_utterance)

asyncio.run(main())

Latency budget

The headline p50 of 640ms end-to-end breaks down as: 140ms STT final, 320ms first-token-from-Kimi, 180ms TTS first-audio. The three stages overlap because every link is streamed — first audio leaves the speaker before the LLM finishes thinking.

If you need tighter latency, the biggest win is a shorter system prompt (every system token is a flat cost on first-token time) and voice-tuned TTS settings. Kimi K2.6 is already comfortably fastest-in-class for voice workloads; swapping in GPT-5.4 typically adds 80-150ms to first-token time.

Add a tool belt

A voice agent with tools is where product value lives — "book me the flight," "check my calendar," "order the usual." Kimi K2.6 handles tool calling natively, so you can add handlers to the `think` step and route voice output around them.

Same session, same history, same key. The only change is a `tools=[...]` argument on the chat-completions call.

tools = [{
    "type": "function",
    "function": {
        "name": "lookup_order",
        "description": "Find a customer's last order.",
        "parameters": {"type": "object", "properties": {"email": {"type": "string"}},
                       "required": ["email"]},
    },
}]

# Pass tools into think() and handle tool_calls in the stream.

FAQ

Do I need Deepgram and ElevenLabs accounts?+

No. Your AIgateway key reaches both providers — the gateway bills their usage pass-through against your balance with a 5% platform fee. No second signup, no second invoice, no second rate limit to manage.

Can I swap voices?+

Change the `voice=` query param on the TTS URL. ElevenLabs' full voice library is reachable, and Cartesia, PlayHT, and Deepgram Aura are available with a provider prefix change.

What about on-device voice?+

The mic and speaker loops are local; only the three inference calls are network. For fully offline inference, self-host Whisper (STT) and an open TTS model alongside a local Kimi K2.6 Base or Llama 4.1 deployment — the Python code doesn't change.

How do I handle interruption?+

When STT fires a finalized utterance while TTS is still speaking, cancel the outstanding TTS write and start the new think() call. The example repo has the four extra lines in the github gist linked from the CTA banner.

Is this low-latency enough for phone calls?+

Yes for most use cases. Production phone systems usually sit at p50 500-800ms; our 640ms is in-band. For telco-level requirements (p50 under 400ms), pre-warm the TTS connection and pin the region to your carrier's PoP.

What does a real 60-minute voice call cost?+

Under $0.25 on typical settings — STT around $0.09, Kimi K2.6 around $0.04 (covered by your $5 signup credit while it lasts), ElevenLabs Turbo around $0.10. A cap on the voice tag is a good idea to prevent a runaway session.

Can I record the conversation for audit?+

Yes — Enterprise tier stores both the raw audio and the transcript with a signed URL. Free and Pro tiers get transcripts via `x-aig-store-transcript: true` and audio retention as a paid add-on.

READY TO BUILD?

Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.

Get your key →API reference Kimi K2.6 details

More examples

Launch a work-automation agent swarm with Hermes + Kimi K2.6

Spin up a five-agent swarm that runs locally on Hermes, routes every call through AIgateway, and automates your inbox, calendar, research, coding, and reporting — Kimi K2.6 handles the reasoning on your $5 signup credit.

Turn any PDF into structured JSON with one vision call

Drop-in extraction pipeline: upload a PDF/invoice/receipt, let Kimi K2.6 or Opus 4.7 read it whole, enforce your schema with JSON mode, and get machine-readable data in 30 lines. 96% field-level accuracy vs OCR on the benchmark set inside.

Build an MCP server any agent can use — hosted, 40 lines

Ship one MCP server on the AIgateway MCP surface and every agent — Claude Code, Cursor, Codex, Cline, anything that speaks MCP — can call its tools. No infra to host, no auth to wire, no schema-sync headaches.