Inference
Audio · TTS / STT
Speech-to-text and text-to-speech through the same OpenAI-shape endpoints you already know — swap model to hit Whisper, Deepgram Nova 3, ElevenLabs Flash / V3, Cartesia Sonic, or OpenAI's TTS-HD. Inputs are OGG/Opus/WAV/MP3/M4A; outputs are MP3/WAV/Opus.
Transcribe audio (STT)
curl https://api.aigateway.sh/v1/audio/transcriptions \ -H "Authorization: Bearer $AIG_KEY" \ -F "file=@meeting.m4a" \ -F "model=deepgram/nova-3" \ -F "response_format=verbose_json" \ -F "timestamp_granularities[]=word" // → { "text": "...", "words": [{ "word": "hi", "start": 0.4, "end": 0.52 }, ...] }
Synthesize speech (TTS)
curl https://api.aigateway.sh/v1/audio/speech \ -H "Authorization: Bearer $AIG_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "elevenlabs/eleven-flash-2.5", "voice": "rachel", "input": "Hello from AIgateway.", "response_format": "mp3" }' \ --output voice.mp3
Pick a model
| Model | Best for |
|---|---|
openai/whisper-1 | Cheap, 90+ languages, good default STT. |
deepgram/nova-3 | Fastest STT, word-level timestamps, diarization. |
elevenlabs/eleven-flash-2.5 | Fastest TTS, ~100ms latency, phone-grade voices. |
elevenlabs/eleven-v3 | Best TTS quality, emotional range, multilingual. |
cartesia/sonic-2 | Realtime streaming TTS for voice agents. |
openai/tts-hd | Balanced quality/cost, fewest pronunciation errors. |
Streaming TTS
Pass stream: true on /v1/audio/speech to receive audio chunks as they synthesize. The content type is audio/mpeg (MP3) or audio/ogg (Opus) — pipe directly into a browser <audio> element via MediaSource. Realtime voice agents use this to achieve end-to-end latency under 400ms.
Diarization + word timestamps
Deepgram Nova 3 returns word-level timestamps and speaker labels when you request response_format=verbose_json with diarize=true. Whisper returns segment-level timestamps only.