examples/cost + ops
Flagship · 7 min build

Ship a zero-downtime LLM chain: Opus → GPT-5.4 → Kimi K2.6

A single OpenAI-SDK call that tries Claude Opus first, falls back to GPT-5.4 on rate limit or timeout, and finally to Kimi K2.6 — with retry budgets, per-provider timeouts, and header-level observability. Zero extra lines of orchestration.

7 min readpublished 2026-04-25category · Cost + ops
Primary, secondary, tertiary — a three-model fallback chain with one API call

Every model has a bad day. Opus gets rate-limited during a launch, GPT's 5xx rate spikes for ten minutes, a new Gemini rollout breaks tool calling. If your product hangs on any single provider, you ship its outage too.

The fix is a chain: try the best model first, then a fallback, then a safety net. What nobody wants is the ten-line try/except pyramid, the per-provider circuit breaker, the retry budget logic, and the timeout-per-model table that usually comes with it.

On AIgateway the chain is the request. You declare an ordered list of models, set a per-provider timeout, and the gateway does the rest — at the edge, before your client sees a single error.

AIgateway keyopenai SDKOne headerZero orchestration code
Note
We watch provider health on a 30-second tick. If Opus is degraded when you call, the gateway skips it before trying — your p99 doesn't pay the failed-request tax.

Build it in three steps

  1. STEP 01

    Declare your chain

    One `x-aig-fallback` header with comma-separated model slugs in preference order. The `model` field becomes the primary; the header lists the fallbacks. Per-provider timeouts and total retry budget live in the same header payload.

    from openai import OpenAI
    
    client = OpenAI(
        base_url="https://api.aigateway.sh/v1",
        api_key="sk-aig-...",
    )
    
    resp = client.chat.completions.create(
        model="anthropic/claude-opus-4.7",           # primary
        messages=[{"role": "user", "content": "Summarize this filing: ..."}],
        extra_headers={
            "x-aig-fallback": "openai/gpt-5.4,moonshot/kimi-k2.6",
            "x-aig-timeout-ms": "8000",              # per-provider timeout
            "x-aig-retry-budget": "2",               # total retries across chain
            "x-aig-tag": "summarize.fallback",       # cost attribution
        },
    )
    
    print(resp.choices[0].message.content)
    print("served_by:", resp.model)
  2. STEP 02

    Read which model actually answered

    The response's `model` field reflects the winning model in the chain — use it for logging, tagging, or switching downstream behavior. For richer detail, three response headers come back for free.

    raw = client.with_raw_response.chat.completions.create(...)
    print(raw.headers.get("x-aig-served-by"))    # "openai/gpt-5.4"
    print(raw.headers.get("x-aig-chain"))        # "anthropic/claude-opus-4.7:429→openai/gpt-5.4:200"
    print(raw.headers.get("x-aig-latency-ms"))   # "540"
  3. STEP 03

    Cap the total spend

    A misconfigured chain can silently triple costs if the primary keeps 5xx-ing. Put a hard cap on the tag and the gateway will stop dispatching once the monthly budget trips — no runaway retries.

    curl -X POST https://api.aigateway.sh/v1/budgets \
      -H "Authorization: Bearer sk-aig-..." \
      -d '{ "tag": "summarize.fallback", "monthly_cap_cents": 5000 }'

What gets decided where

Primary selection is yours — you know which model you prefer for this call. Everything after that is the gateway's job: provider health, timeout enforcement, retry budget accounting, and which fallback to try next.

The retry budget is per-call, not per-chain. A budget of 2 lets the gateway retry across at most two provider boundaries before giving up; it's the knob that keeps a bad day from becoming a thundering-herd incident.

Observability is on the response, not in logs you have to ship somewhere. `x-aig-chain` tells you exactly which models got tried and what they returned. Ship that to your APM once, read it any time you get paged.

OpenAI SDK + Anthropic SDK both work

Already writing code against the Anthropic SDK? Point `base_url` at `https://api.aigateway.sh/anthropic` and keep using Anthropic's shape — the fallback headers work identically.

The chain logic lives at the gateway, not in your client. Whatever shape you send, the same primary/secondary/tertiary dispatch runs underneath.

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.aigateway.sh/anthropic",
    api_key="sk-aig-...",
    default_headers={
        "x-aig-fallback": "openai/gpt-5.4,moonshot/kimi-k2.6",
        "x-aig-timeout-ms": "8000",
    },
)

FAQ

How is this different from OpenRouter's fallback?+

Three differences. (1) Primary is a normal `model` field — not buried in an array — so your code reads like a single-model call. (2) Per-provider timeouts and a global retry budget are first-class headers, not best-effort defaults. (3) The `x-aig-chain` response header tells you exactly what got tried with what outcome, without shipping logs.

What triggers a fallback?+

429 rate-limit, 5xx server error, timeout (as set by `x-aig-timeout-ms`), or the gateway's health monitor flagging the provider as degraded. Content-filter flags do not trigger fallback by default — set `x-aig-fallback-on=4xx,5xx,timeout,filter` if you want them to.

Will streaming work?+

Yes, and the fallback decision happens before the first token. If the primary errors mid-stream, the gateway does not silently switch — you'd get a mid-stream shift in voice. Instead the stream errors cleanly so your client can retry and the fallback kicks in on the retry.

Can I chain across shapes (OpenAI + Anthropic + Cohere)?+

Yes. The gateway normalizes requests and responses to the canonical shape and back. Your chain can freely mix providers of different native shapes — the fallback chain reads the same from your code.

Will retries double-bill me?+

No. Only the winning request counts against your usage. Failed attempts against a degraded primary are not billed to the caller — that's the point of the health monitor.

What if all three fail?+

You get a standard error with `x-aig-chain` populated, so you know exactly what each tier returned. The retry budget is the safety valve — with `retry-budget: 2` the gateway makes at most three attempts total before failing fast.

Can I set different fallbacks per request?+

Yes. The headers are per-request, so a `summarize` call can fall back to Kimi while a `code-generate` call falls back to GPT-5.4 — same key, same endpoint.

READY TO BUILD?
Get an AIgateway key in 30 seconds. Free Kimi K2.6 through Apr 30, 2026; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples