Audio · TTS / STT
Speech-to-text and text-to-speech through the same OpenAI-shape endpoints you already know — swap model to hit Whisper, Deepgram Nova 3, ElevenLabs Flash / V3, Cartesia Sonic, or OpenAI's TTS-HD. Inputs are OGG/Opus/WAV/MP3/M4A; outputs are MP3/WAV/Opus.
Transcribe audio (STT)
curl https://api.aigateway.sh/v1/audio/transcriptions \ -H "Authorization: Bearer $AIG_KEY" \ -F "file=@meeting.m4a" \ -F "model=deepgram/nova-3" \ -F "response_format=verbose_json" \ -F "timestamp_granularities[]=word" // → { "text": "...", "words": [{ "word": "hi", "start": 0.4, "end": 0.52 }, ...] }
Async transcription (batch)
For long recordings, transcribe out of band: add async: true (or a webhook_url) and the call returns a job id immediately. Poll GET /v1/jobs/<id> for the transcript, or have the signed result pushed to your webhook. Available on deepgram/nova-3 and deepgram/flux.
curl -X POST https://api.aigateway.sh/v1/audio/transcriptions \ -H "Authorization: Bearer $AIG_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"deepgram/nova-3","audio_url":"https://example.com/call.wav","async":true}' // → { "id": "<job_id>", "status": "processing" } curl https://api.aigateway.sh/v1/jobs/<job_id> -H "Authorization: Bearer $AIG_KEY" // → { "status": "completed", "result": { "transcript": { "text": "...", "segments": [...] } } }
Realtime streaming transcription
Stream audio over a WebSocket and get interim + final transcripts live — for voice agents, live captions, and call centers. Connect to /v1/realtime, send raw audio frames, and end with { "type": "CloseStream" }. Browsers pass the key as ?api_key=; servers can use the Authorization header. Available on deepgram/nova-3 and deepgram/flux.
const ws = new WebSocket( "wss://api.aigateway.sh/v1/realtime?model=deepgram/nova-3&encoding=linear16&sample_rate=16000&interim_results=true&api_key=" + AIG_KEY, ); ws.onmessage = (e) => { const msg = JSON.parse(e.data); if (msg.type === "Results") console.log(msg.channel.alternatives[0].transcript, msg.is_final); }; // stream raw linear16 PCM frames with ws.send(chunk), then end the stream: ws.send(JSON.stringify({ type: "CloseStream" }));
Synthesize speech (TTS)
curl https://api.aigateway.sh/v1/audio/speech \ -H "Authorization: Bearer $AIG_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "elevenlabs/eleven-flash-2.5", "voice": "rachel", "input": "Hello from AIgateway.", "response_format": "mp3" }' \ --output voice.mp3
Pick a model
| Model | Best for |
|---|---|
openai/whisper-1 | Cheap, 90+ languages, good default STT. |
deepgram/nova-3 | Fastest STT, word-level timestamps, diarization. |
elevenlabs/eleven-flash-2.5 | Fastest TTS, ~100ms latency, phone-grade voices. |
elevenlabs/eleven-v3 | Best TTS quality, emotional range, multilingual. |
cartesia/sonic-2 | Realtime streaming TTS for voice agents. |
openai/tts-hd | Balanced quality/cost, fewest pronunciation errors. |
Streaming TTS
Pass stream: true on /v1/audio/speech to receive audio chunks as they synthesize. The content type is audio/mpeg (MP3) or audio/ogg (Opus) — pipe directly into a browser <audio> element via MediaSource. Realtime voice agents use this to achieve end-to-end latency under 400ms.
Diarization + word timestamps
Deepgram Nova 3 returns word-level timestamps and speaker labels when you request response_format=verbose_json with diarize=true. Whisper returns segment-level timestamps only.