View as /docs.md
Inference

Audio · TTS / STT

Speech-to-text and text-to-speech through the same OpenAI-shape endpoints you already know — swap model to hit Whisper, Deepgram Nova 3, ElevenLabs Flash / V3, Cartesia Sonic, or OpenAI's TTS-HD. Inputs are OGG/Opus/WAV/MP3/M4A; outputs are MP3/WAV/Opus.

Transcribe audio (STT)

curl https://api.aigateway.sh/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIG_KEY" \
  -F "file=@meeting.m4a" \
  -F "model=deepgram/nova-3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"
// → { "text": "...", "words": [{ "word": "hi", "start": 0.4, "end": 0.52 }, ...] }

Synthesize speech (TTS)

curl https://api.aigateway.sh/v1/audio/speech \
  -H "Authorization: Bearer $AIG_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven-flash-2.5",
    "voice": "rachel",
    "input": "Hello from AIgateway.",
    "response_format": "mp3"
  }' \
  --output voice.mp3

Pick a model

ModelBest for
openai/whisper-1Cheap, 90+ languages, good default STT.
deepgram/nova-3Fastest STT, word-level timestamps, diarization.
elevenlabs/eleven-flash-2.5Fastest TTS, ~100ms latency, phone-grade voices.
elevenlabs/eleven-v3Best TTS quality, emotional range, multilingual.
cartesia/sonic-2Realtime streaming TTS for voice agents.
openai/tts-hdBalanced quality/cost, fewest pronunciation errors.

Streaming TTS

Pass stream: true on /v1/audio/speech to receive audio chunks as they synthesize. The content type is audio/mpeg (MP3) or audio/ogg (Opus) — pipe directly into a browser <audio> element via MediaSource. Realtime voice agents use this to achieve end-to-end latency under 400ms.

Diarization + word timestamps

Deepgram Nova 3 returns word-level timestamps and speaker labels when you request response_format=verbose_json with diarize=true. Whisper returns segment-level timestamps only.