Three modalities, one key: Deepgram streaming transcription feeds Kimi K2.6, which streams into ElevenLabs — voice-to-voice with p50 latency under 650ms. Runs on the free Kimi tier through Apr 30.
A voice agent is the clearest showcase of a unified gateway. The pipeline has three completely different modalities — streaming speech, a text LLM, streaming speech synthesis — and most teams lose a week wiring them together with three SDKs, three invoices, and three rate-limit headaches.
On AIgateway it's eighty lines. One key, one SDK, three model slugs. The latency budget under 650ms end-to-end comes from streaming every stage: STT emits partial transcripts, the LLM streams tokens, the TTS receives tokens-as-sentences and speaks them in real time.
Send raw 16-bit PCM from the mic to AIgateway's Deepgram-compatible streaming endpoint. Partial transcripts come back every ~100ms; treat each finalized utterance as a turn.
import asyncio, websockets, pyaudio
STT_URL = "wss://api.aigateway.sh/v1/audio/transcriptions/stream?model=deepgram/nova-3"
async def listen(on_utterance):
mic = pyaudio.PyAudio().open(rate=16000, channels=1, format=pyaudio.paInt16,
input=True, frames_per_buffer=1024)
async with websockets.connect(STT_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
async def mic_loop():
while True:
await ws.send(mic.read(1024, exception_on_overflow=False))
asyncio.create_task(mic_loop())
async for msg in ws:
evt = __import__("json").loads(msg)
if evt.get("is_final"):
await on_utterance(evt["transcript"])Each utterance becomes a user turn. Stream the LLM response token-by-token so the TTS can start speaking before the LLM is done thinking.
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
SYSTEM = "You are a helpful voice assistant. Keep replies under 3 sentences, conversational, no markdown."
async def think(history: list[dict], on_token):
stream = await client.chat.completions.create(
model="moonshot/kimi-k2.6",
messages=[{"role": "system", "content": SYSTEM}, *history],
stream=True,
extra_headers={"x-aig-tag": "voice.llm"},
)
async for chunk in stream:
tok = chunk.choices[0].delta.content
if tok:
await on_token(tok)Open a TTS websocket, push sentences as they form (flush on ". " / "? " / "! "). ElevenLabs streams audio back in real time; write it to the speaker.
TTS_URL = "wss://api.aigateway.sh/v1/audio/speech/stream?model=elevenlabs/eleven-turbo-v3&voice=rachel"
async def speak():
buf = ""
async with websockets.connect(TTS_URL, extra_headers={"Authorization": "Bearer sk-aig-..."}) as ws:
speaker = pyaudio.PyAudio().open(rate=24000, channels=1, format=pyaudio.paInt16, output=True)
async def on_token(tok):
nonlocal buf
buf += tok
if buf and buf[-1] in ".?!":
await ws.send(__import__("json").dumps({"text": buf}))
buf = ""
async def drain():
async for audio in ws:
speaker.write(audio)
asyncio.create_task(drain())
return on_tokenHook the three stages up and you've got a full voice loop. Keep history in memory, add a hang-up keyword, ship.
async def main():
history = []
on_token = await speak()
async def on_utterance(text: str):
print("user:", text)
if text.strip().lower() in {"bye", "goodbye", "hang up"}:
raise SystemExit
history.append({"role": "user", "content": text})
await think(history, on_token)
await listen(on_utterance)
asyncio.run(main())The headline p50 of 640ms end-to-end breaks down as: 140ms STT final, 320ms first-token-from-Kimi, 180ms TTS first-audio. The three stages overlap because every link is streamed — first audio leaves the speaker before the LLM finishes thinking.
If you need tighter latency, the biggest win is a shorter system prompt (every system token is a flat cost on first-token time) and voice-tuned TTS settings. Kimi K2.6 is already comfortably fastest-in-class for voice workloads; swapping in GPT-5.4 typically adds 80-150ms to first-token time.
A voice agent with tools is where product value lives — "book me the flight," "check my calendar," "order the usual." Kimi K2.6 handles tool calling natively, so you can add handlers to the `think` step and route voice output around them.
Same session, same history, same key. The only change is a `tools=[...]` argument on the chat-completions call.
tools = [{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Find a customer's last order.",
"parameters": {"type": "object", "properties": {"email": {"type": "string"}},
"required": ["email"]},
},
}]
# Pass tools into think() and handle tool_calls in the stream.No. Your AIgateway key reaches both providers — the gateway bills their usage pass-through against your balance with a 5% platform fee. No second signup, no second invoice, no second rate limit to manage.
Change the `voice=` query param on the TTS URL. ElevenLabs' full voice library is reachable, and Cartesia, PlayHT, and Deepgram Aura are available with a provider prefix change.
The mic and speaker loops are local; only the three inference calls are network. For fully offline inference, swap to the Workers-AI Whisper (STT) and MeloTTS (TTS) slugs plus a local Kimi K2.6 Base or Llama 4.1 deployment — the Python code doesn't change.
When STT fires a finalized utterance while TTS is still speaking, cancel the outstanding TTS write and start the new think() call. The example repo has the four extra lines in the github gist linked from the CTA banner.
Yes for most use cases. Production phone systems usually sit at p50 500-800ms; our 640ms is in-band. For telco-level requirements (p50 under 400ms), pre-warm the TTS connection and pin the region to your carrier's PoP.
Under $0.25 on typical settings — STT around $0.09, Kimi K2.6 free on trial (or ~$0.04 paid), ElevenLabs Turbo around $0.10. A cap on the voice tag is a good idea to prevent a runaway session.
Yes — Enterprise tier stores both the raw audio and the transcript with a signed URL. Free and Pro tiers get transcripts via `x-aig-store-transcript: true` and audio retention as a paid add-on.