Inference

Audio · TTS / STT

Speech-to-text and text-to-speech through the same OpenAI-shape endpoints you already know — swap model to hit Whisper, Deepgram Nova 3, ElevenLabs Flash / V3, Cartesia Sonic, or OpenAI's TTS-HD. Inputs are OGG/Opus/WAV/MP3/M4A; outputs are MP3/WAV/Opus.

Transcribe audio (STT)

curl https://api.aigateway.sh/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIG_KEY" \
  -F "file=@meeting.m4a" \
  -F "model=deepgram/nova-3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"
// → { "text": "...", "words": [{ "word": "hi", "start": 0.4, "end": 0.52 }, ...] }

Async transcription (batch)

For long recordings, transcribe out of band: add async: true (or a webhook_url) and the call returns a job id immediately. Poll GET /v1/jobs/<id> for the transcript, or have the signed result pushed to your webhook. Available on deepgram/nova-3 and deepgram/flux.

curl -X POST https://api.aigateway.sh/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIG_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"deepgram/nova-3","audio_url":"https://example.com/call.wav","async":true}'
// → { "id": "<job_id>", "status": "processing" }

curl https://api.aigateway.sh/v1/jobs/<job_id> -H "Authorization: Bearer $AIG_KEY"
// → { "status": "completed", "result": { "transcript": { "text": "...", "segments": [...] } } }

Realtime streaming transcription

Stream audio over a WebSocket and get interim + final transcripts live — for voice agents, live captions, and call centers. Connect to /v1/realtime, send raw audio frames, and end with { "type": "CloseStream" }. Browsers pass the key as ?api_key=; servers can use the Authorization header. Available on deepgram/nova-3 and deepgram/flux.

const ws = new WebSocket(
  "wss://api.aigateway.sh/v1/realtime?model=deepgram/nova-3&encoding=linear16&sample_rate=16000&interim_results=true&api_key=" + AIG_KEY,
);
ws.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === "Results") console.log(msg.channel.alternatives[0].transcript, msg.is_final);
};
// stream raw linear16 PCM frames with ws.send(chunk), then end the stream:
ws.send(JSON.stringify({ type: "CloseStream" }));

Synthesize speech (TTS)

curl https://api.aigateway.sh/v1/audio/speech \
  -H "Authorization: Bearer $AIG_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven-flash-2.5",
    "voice": "rachel",
    "input": "Hello from AIgateway.",
    "response_format": "mp3"
  }' \
  --output voice.mp3

Pick a model

Model	Best for
`openai/whisper-1`	Cheap, 90+ languages, good default STT.
`deepgram/nova-3`	Fastest STT, word-level timestamps, diarization.
`elevenlabs/eleven-flash-2.5`	Fastest TTS, ~100ms latency, phone-grade voices.
`elevenlabs/eleven-v3`	Best TTS quality, emotional range, multilingual.
`cartesia/sonic-2`	Realtime streaming TTS for voice agents.
`openai/tts-hd`	Balanced quality/cost, fewest pronunciation errors.

Streaming TTS

Pass stream: true on /v1/audio/speech to receive audio chunks as they synthesize. The content type is audio/mpeg (MP3) or audio/ogg (Opus) — pipe directly into a browser <audio> element via MediaSource. Realtime voice agents use this to achieve end-to-end latency under 400ms.

Diarization + word timestamps

Deepgram Nova 3 returns word-level timestamps and speaker labels when you request response_format=verbose_json with diarize=true. Whisper returns segment-level timestamps only.

← PreviousImage generation Next →Embeddings