Smart routing + fallbacks
Most popular open-weight models are served by 5+ providers (Fireworks, Together, Groq, DeepInfra, Cerebras, Novita, SambaNova). Smart routing picks the best live route per request based on your chosen policy — latency, cost, or throughput — and fails over when a provider is unhealthy, rate-limited, or slow.
Default behaviour
Every model ID resolves to a preferred provider. If it responds in under the health budget (2s TTFT, 99th-pctile), we use it. Otherwise we try the next best route in the pool. You don't need to do anything — this is on by default and invisible to your code.
Pick a routing policy
POST /v1/chat/completions { "model": "meta-llama/llama-4-maverick-instruct", "messages": [...], "routing": { "policy": "lowest_latency", "max_cost_per_1m_input": 0.30, "providers": ["fireworks", "groq", "together"], "allow_fallback": true } }
| Policy | Picks the route with… |
|---|---|
lowest_latency | Lowest 30-second trailing p50 TTFT (default for streaming). |
lowest_cost | Cheapest provider that meets health budget. |
highest_throughput | Best tokens/sec decode speed — for long outputs. |
pinned | The exact provider in providers[0]. No fallback. |
Fallback chains
When a provider returns 429 / 500 / times out, we retry on the next route automatically (up to 3 hops, budget under 8s). Set allow_fallback: false if you want deterministic errors — useful for testing or when a specific provider's behaviour matters.
Cross-model fallback
Pass an array as model to degrade gracefully to a different model on error. The gateway tries each in order until one succeeds.
{
"model": [
"anthropic/claude-4.6-sonnet",
"openai/gpt-5.4",
"meta-llama/llama-4-maverick-instruct"
],
"messages": [...]
}Inspecting the chosen route
Every response includes x-aig-provider, x-aig-route-policy, and x-aig-attempts headers. Your dashboard → Observability also shows the full fallback trace per request.