Auto Router
Set model:"auto" and the router reads each request, picks the cheapest model in a curated pool that still clears the quality floor, and never charges you more than the premium model you would have called yourself. It works across text, image, video, speech, transcription, music, and embeddings — one field covers everything you build. Every routed response tells you exactly what ran and what you saved.
This is distinct from Smart routing + fallbacks, which picks the best live provider for one model ID and fails over when a provider is unhealthy. The Auto Router instead selects which model runs for a request based on its complexity. You can use both: auto picks the model, smart routing then picks the route to it.
Usage
The only change from a normal call is the model value. Send model:"auto" to route by modality inferred from the endpoint, scope it to a specific modality with model:"auto/<modality>", or omit model entirely. Everything else is OpenAI-compatible.
POST /v1/chat/completions Authorization: Bearer sk-aig-... { "model": "auto", "messages": [ { "role": "user", "content": "What time zone is Lisbon in?" } ] } // scope to one modality { "model": "auto/image", "prompt": "concept sketch of a kayak" } // set your own ceiling + bias the pick { "model": "auto", "baseline_model": "anthropic/claude-opus-4.8", "messages": [...] } // + header: x-routing: cost
Supported modalities
Auto spans every generative modality, not just text. Scope a request to one lane with the auto/<modality> form:
| Model value | Modality | Endpoint |
|---|---|---|
auto/text | Chat / completion | /v1/chat/completions |
auto/image | Image generation | /v1/images/generations |
auto/video | Video generation | /v1/videos |
auto/tts | Text-to-speech | /v1/audio/speech |
auto/stt | Transcription | /v1/audio/transcriptions |
auto/music | Music generation | /v1/audio/music |
auto/embedding | Embeddings | /v1/embeddings |
Bare model:"auto" (or an omitted model) routes within the modality implied by the endpoint you hit, so a /v1/embeddings call with model:"auto" behaves like auto/embedding.
How routing down works
The router reads the complexity of each request and uses it as the cost-versus-quality knob. A factual lookup reads as low complexity and drops to a cheaper tier; a multi-step reasoning prompt holds near the top of the pool. For text this is a fast classifier over your messages; for generative media it reads prompt richness (length, detail cues like cinematic / photoreal / high-res).
| Complexity read | Signals | Tends toward |
|---|---|---|
simple | Short, factual, single-turn, no code or multi-step ask | Economy tier |
moderate | Some length, light reasoning, a few turns of context | Standard tier |
complex | Long, code-heavy, multi-step, deep conversation, large system prompt | Premium tier (up to your baseline) |
The router only ever routes down from the baseline. Candidates more expensive than the baseline are filtered out before scoring, so the worst case is that auto picks the baseline itself — never anything pricier.
Curated pools + quality floor
Auto only selects from a curated, tiered pool per modality — premium, standard, and economy — maintained by hand against public benchmark leaderboards. Every member carries a quality prior in [0, 1], refined by real eval scores where they exist. Candidates below the modality's quality floor are filtered out before routing, so the router trades down on price, never down to garbage.
The text pool, for example:
| Tier | Members (text) |
|---|---|
premium | openai/gpt-5.5-pro, anthropic/claude-opus-4.8, google/gemini-3.1-pro, xai/grok-4 |
standard | anthropic/claude-sonnet-4.6, moonshot/kimi-k2.6, google/gemini-3-flash, xai/grok-4-fast |
economy | anthropic/claude-haiku-4.5, openai/gpt-oss-120b, google/gemma-4-26b-a4b-it, meta/llama-3.3-70b-instruct-fp8-fast |
Each modality has its own pool and its own baseline. Pools are versioned and revised as models and benchmarks change; you do not pin a pool — you pin a baseline, and the router works the curated set beneath it.
Parameters: baseline_model + x-routing
baseline_model
Every request carries a baseline: the model you would otherwise have called. Set it explicitly in the body with baseline_model, or leave it unset and it defaults to the premium model for that modality. The baseline does double duty:
- Measuring stick — savings are computed against what the baseline would have cost for this exact request.
- Hard cost ceiling — the router only considers models no more expensive than the baseline, so you can never pay more than the baseline price.
Per-modality default baselines: text → anthropic/claude-opus-4.8, image → recraft/recraftv4-pro, video → google/veo-3.1, tts → minimax/speech-2.8-hd, stt → openai/gpt-4o-transcribe, music → fal-ai/elevenlabs/music, embedding → baai/bge-m3.
x-routing header
Bias the pick with the x-routing request header. It does not change the ceiling — every option stays at or under your baseline — it only changes how the router weighs cost against quality.
x-routing | Picks… |
|---|---|
cost | The cheapest model that clears the quality floor. |
speed | The fastest qualifying model in the pool. |
quality | The strongest model still at or under your baseline. |
auto | Balances cost and quality from the complexity read. Default when the header is absent. |
Transparency response headers
Auto is not a black box. Every routed response returns the full picture: the model that ran, why, the complexity read, the quality score, your premium baseline, and the exact dollars saved. Log these and you can audit any single call or chart your savings over time.
| Header | Meaning |
|---|---|
X-Auto-Routed | true on any auto-routed call. Absent on explicit-model calls. |
X-Routing-Selected | The model id that actually ran. |
X-Routing-Reason | Short trace, e.g. auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8). |
X-Routing-Complexity | The complexity read: simple, moderate, or complex. |
X-Routing-Quality | Quality score of the selected model, 0–1, three decimals. |
X-Auto-Baseline-Model | The premium baseline this call was measured and capped against. |
X-Auto-Baseline-Cost-Cents | What the baseline would have cost for this exact request, in cents. |
X-Auto-Route-Fee-Cents | The platform fee on this call, in cents (the savings share). |
X-Auto-Savings-Cents | Net dollars you saved versus the baseline, after the fee, in cents. |
X-Routing-Selected, X-Routing-Reason, X-Routing-Complexity, X-Auto-Baseline-Model, X-Auto-Routed) arrive up front, before the first token. Cost and savings headers settle once the stream closes.Example response headers on a routed text call:
HTTP/1.1 200 OK X-Auto-Routed: true X-Routing-Selected: anthropic/claude-haiku-4.5 X-Routing-Reason: auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8) X-Routing-Complexity: simple X-Routing-Quality: 0.780 X-Auto-Baseline-Model: anthropic/claude-opus-4.8 X-Auto-Baseline-Cost-Cents: 0.9975 X-Auto-Route-Fee-Cents: 0.2394 X-Auto-Savings-Cents: 0.5586 X-Cost-Cents: 0.4389
Quickstart
Drop auto into the model field of any call you already make. Read the X-Routing-* headers off the response to see what ran.
curl
curl https://api.aigateway.sh/v1/chat/completions -i \ -H "Authorization: Bearer sk-aig-..." \ -H "Content-Type: application/json" \ -H "x-routing: cost" \ -d '{ "model": "auto", "baseline_model": "anthropic/claude-opus-4.8", "messages": [{ "role": "user", "content": "What time zone is Lisbon in?" }] }'
Python
from openai import OpenAI client = OpenAI( base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...", ) resp = client.chat.completions.with_raw_response.create( model="auto", messages=[{"role": "user", "content": "What time zone is Lisbon in?"}], extra_body={"baseline_model": "anthropic/claude-opus-4.8"}, extra_headers={"x-routing": "cost"}, ) # the headers tell you what ran and what you saved print(resp.headers["x-routing-selected"]) # anthropic/claude-haiku-4.5 print(resp.headers["x-auto-savings-cents"]) # 0.5586 completion = resp.parse()
TypeScript
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.aigateway.sh/v1", apiKey: "sk-aig-...", }); const { data, response } = await client.chat.completions .create( { model: "auto", messages: [{ role: "user", content: "What time zone is Lisbon in?" }], // baseline_model is an AIgateway extension; pass it through baseline_model: "anthropic/claude-opus-4.8", } as any, { headers: { "x-routing": "cost" } }, ) .withResponse(); console.log(response.headers.get("x-routing-selected")); // anthropic/claude-haiku-4.5 console.log(response.headers.get("x-auto-savings-cents")); // 0.5586
Pricing
The headline is simple: on every routed call you pay less than the premium baseline, guaranteed. Here is the exact mechanic.
On a routed call you pay the selected model's cost (at the standard platform per-call fee) plus a share of what the router saved you:
- Base platform fee — the same 5% per-call fee you pay on any AIgateway call, applied to the routed model's pass-through cost.
- Savings share — 30% of the dollars the router saved versus your baseline. You keep the other 70% of every dollar saved.
- Never above the baseline — by construction, your charge is always at or below what the baseline would have cost. The baseline is a hard ceiling.
- $0 fee when there is no saving — if a request routes to the baseline itself (no cheaper qualifying model), there is no saving and no savings-share fee; you pay the plain baseline price.
Concretely, the charge is routed_cost + 30% × max(0, baseline_cost − routed_cost), where both costs already include the 5% per-call fee. Because the share is a fraction of a positive saving, the result is always ≤ baseline_cost, with equality only when there is nothing to save.
Worked example (text)
A simple support lookup, ~400 input + 300 output tokens, with baseline_model: anthropic/claude-opus-4.8 ($5 in / $25 out per Mtok). The router reads low complexity and drops to anthropic/claude-haiku-4.5 ($1 in / $5 out per Mtok):
| Line | Per call |
|---|---|
| Baseline cost (Opus 4.8, incl. fee) | $0.00997 |
| Routed model cost (Haiku 4.5, incl. fee) | $0.00199 |
| Gross savings vs baseline | $0.00798 |
| Savings share (30% of gross) | $0.00239 |
| You pay | $0.00439 (vs $0.00997 baseline) |
| You keep (70% of savings) | $0.00559 — 56% cheaper |
Across 1,000 such calls that is $4.39 instead of $9.98. The same mechanic applies to media modalities, priced on their billing unit (per image, per second, per 1k characters, per minute) rather than tokens.
For the full pricing model across all modalities, see Pricing.
What to try next
- Run a request in the Playground with
model:"auto"and watch the headers. - Pair auto with Smart routing + fallbacks so the chosen model also gets the best live route.
- Set a tighter ceiling with
baseline_modelon a per-endpoint basis — e.g. your top coding model on an agent loop.