View as /docs.md
Platform

Smart routing + fallbacks

Most popular open-weight models are served by 5+ providers (Fireworks, Together, Groq, DeepInfra, Cerebras, Novita, SambaNova). Smart routing picks the best live route per request based on your chosen policy — latency, cost, or throughput — and fails over when a provider is unhealthy, rate-limited, or slow.

Default behaviour

Every model ID resolves to a preferred provider. If it responds in under the health budget (2s TTFT, 99th-pctile), we use it. Otherwise we try the next best route in the pool. You don't need to do anything — this is on by default and invisible to your code.

Pick a routing policy

POST /v1/chat/completions
{
  "model": "meta-llama/llama-4-maverick-instruct",
  "messages": [...],
  "routing": {
    "policy": "lowest_latency",
    "max_cost_per_1m_input": 0.30,
    "providers": ["fireworks", "groq", "together"],
    "allow_fallback": true
  }
}
PolicyPicks the route with…
lowest_latencyLowest 30-second trailing p50 TTFT (default for streaming).
lowest_costCheapest provider that meets health budget.
highest_throughputBest tokens/sec decode speed — for long outputs.
pinnedThe exact provider in providers[0]. No fallback.

Fallback chains

When a provider returns 429 / 500 / times out, we retry on the next route automatically (up to 3 hops, budget under 8s). Set allow_fallback: false if you want deterministic errors — useful for testing or when a specific provider's behaviour matters.

Cross-model fallback

Pass an array as model to degrade gracefully to a different model on error. The gateway tries each in order until one succeeds.

{
  "model": [
    "anthropic/claude-4.6-sonnet",
    "openai/gpt-5.4",
    "meta-llama/llama-4-maverick-instruct"
  ],
  "messages": [...]
}

Inspecting the chosen route

Every response includes x-aig-provider, x-aig-route-policy, and x-aig-attempts headers. Your dashboard → Observability also shows the full fallback trace per request.