A single OpenAI-SDK call that tries Claude Opus first, falls back to GPT-5.4 on rate limit or timeout, and finally to Kimi K2.6 — with retry budgets, per-provider timeouts, and header-level observability. Zero extra lines of orchestration.
Every model has a bad day. Opus gets rate-limited during a launch, GPT's 5xx rate spikes for ten minutes, a new Gemini rollout breaks tool calling. If your product hangs on any single provider, you ship its outage too.
The fix is a chain: try the best model first, then a fallback, then a safety net. What nobody wants is the ten-line try/except pyramid, the per-provider circuit breaker, the retry budget logic, and the timeout-per-model table that usually comes with it.
On AIgateway the chain is the request. You declare an ordered list of models, set a per-provider timeout, and the gateway does the rest — at the edge, before your client sees a single error.
One `x-aig-fallback` header with comma-separated model slugs in preference order. The `model` field becomes the primary; the header lists the fallbacks. Per-provider timeouts and total retry budget live in the same header payload.
from openai import OpenAI
client = OpenAI(
base_url="https://api.aigateway.sh/v1",
api_key="sk-aig-...",
)
resp = client.chat.completions.create(
model="anthropic/claude-opus-4.7", # primary
messages=[{"role": "user", "content": "Summarize this filing: ..."}],
extra_headers={
"x-aig-fallback": "openai/gpt-5.4,moonshot/kimi-k2.6",
"x-aig-timeout-ms": "8000", # per-provider timeout
"x-aig-retry-budget": "2", # total retries across chain
"x-aig-tag": "summarize.fallback", # cost attribution
},
)
print(resp.choices[0].message.content)
print("served_by:", resp.model)The response's `model` field reflects the winning model in the chain — use it for logging, tagging, or switching downstream behavior. For richer detail, three response headers come back for free.
raw = client.with_raw_response.chat.completions.create(...)
print(raw.headers.get("x-aig-served-by")) # "openai/gpt-5.4"
print(raw.headers.get("x-aig-chain")) # "anthropic/claude-opus-4.7:429→openai/gpt-5.4:200"
print(raw.headers.get("x-aig-latency-ms")) # "540"A misconfigured chain can silently triple costs if the primary keeps 5xx-ing. Put a hard cap on the tag and the gateway will stop dispatching once the monthly budget trips — no runaway retries.
curl -X POST https://api.aigateway.sh/v1/budgets \
-H "Authorization: Bearer sk-aig-..." \
-d '{ "tag": "summarize.fallback", "monthly_cap_cents": 5000 }'Primary selection is yours — you know which model you prefer for this call. Everything after that is the gateway's job: provider health, timeout enforcement, retry budget accounting, and which fallback to try next.
The retry budget is per-call, not per-chain. A budget of 2 lets the gateway retry across at most two provider boundaries before giving up; it's the knob that keeps a bad day from becoming a thundering-herd incident.
Observability is on the response, not in logs you have to ship somewhere. `x-aig-chain` tells you exactly which models got tried and what they returned. Ship that to your APM once, read it any time you get paged.
Already writing code against the Anthropic SDK? Point `base_url` at `https://api.aigateway.sh/anthropic` and keep using Anthropic's shape — the fallback headers work identically.
The chain logic lives at the gateway, not in your client. Whatever shape you send, the same primary/secondary/tertiary dispatch runs underneath.
import anthropic
client = anthropic.Anthropic(
base_url="https://api.aigateway.sh/anthropic",
api_key="sk-aig-...",
default_headers={
"x-aig-fallback": "openai/gpt-5.4,moonshot/kimi-k2.6",
"x-aig-timeout-ms": "8000",
},
)Three differences. (1) Primary is a normal `model` field — not buried in an array — so your code reads like a single-model call. (2) Per-provider timeouts and a global retry budget are first-class headers, not best-effort defaults. (3) The `x-aig-chain` response header tells you exactly what got tried with what outcome, without shipping logs.
429 rate-limit, 5xx server error, timeout (as set by `x-aig-timeout-ms`), or the gateway's health monitor flagging the provider as degraded. Content-filter flags do not trigger fallback by default — set `x-aig-fallback-on=4xx,5xx,timeout,filter` if you want them to.
Yes, and the fallback decision happens before the first token. If the primary errors mid-stream, the gateway does not silently switch — you'd get a mid-stream shift in voice. Instead the stream errors cleanly so your client can retry and the fallback kicks in on the retry.
Yes. The gateway normalizes requests and responses to the canonical shape and back. Your chain can freely mix providers of different native shapes — the fallback chain reads the same from your code.
No. Only the winning request counts against your usage. Failed attempts against a degraded primary are not billed to the caller — that's the point of the health monitor.
You get a standard error with `x-aig-chain` populated, so you know exactly what each tier returned. The retry budget is the safety valve — with `retry-budget: 2` the gateway makes at most three attempts total before failing fast.
Yes. The headers are per-request, so a `summarize` call can fall back to Kimi while a `code-generate` call falls back to GPT-5.4 — same key, same endpoint.