How do I use the Auto Router?

Set model:"auto" on any request, or scope it to a modality with model:"auto/text" (also image, video, tts, stt, music, embedding). You can also omit the model field entirely. Everything else stays OpenAI-compatible — only the model value changes.

Will auto ever cost more than calling the model I would have picked?

No. Every request carries a baseline (set it with baseline_model, or it defaults to the premium model for that modality). The router only ever selects models no more expensive than the baseline, so the baseline is a hard cost ceiling. On routed calls you pay strictly less than the premium price; in the worst case you pay exactly the baseline.

How do I see what the router picked?

Every routed response returns transparency headers: X-Routing-Selected (the model that ran), X-Routing-Reason, X-Routing-Complexity, X-Routing-Quality, X-Auto-Baseline-Model, X-Auto-Baseline-Cost-Cents, and X-Auto-Savings-Cents. On streaming responses the routing-decision headers arrive up front.

Can I bias the router toward speed or quality?

Yes. Send the x-routing header set to cost, speed, quality, or auto. Cost favors the cheapest model that clears the floor; quality favors the strongest model still at or under your baseline; auto balances both from the complexity read.

Which model does it actually choose from?

A curated, tiered pool per modality (premium / standard / economy), maintained by hand against public benchmark leaderboards. Every candidate carries a quality prior, refined by real eval scores, and anything below the modality quality floor is filtered out before routing. Your explicit model ids always reach the full catalog — auto just keeps you inside the curated set.

What does the Auto Router cost?

You pay less than the premium baseline on every routed call, guaranteed. On a routed call you pay the selected model cost plus 30% of the dollars the router saved you versus your baseline; you keep the other 70% of every dollar saved, and the bill never rises above the baseline. When there is no saving the fee is $0.

View as /docs.md

Platform

Auto Router

Set model:"auto" and the router reads each request, picks the cheapest model in a curated pool that still clears the quality floor, and never charges you more than the premium model you would have called yourself. It works across text, image, video, speech, transcription, music, and embeddings — one field covers everything you build. Every routed response tells you exactly what ran and what you saved.

This is distinct from Smart routing + fallbacks, which picks the best live provider for one model ID and fails over when a provider is unhealthy. The Auto Router instead selects which model runs for a request based on its complexity. You can use both: auto picks the model, smart routing then picks the route to it.

Usage

The only change from a normal call is the model value. Send model:"auto" to route by modality inferred from the endpoint, scope it to a specific modality with model:"auto/<modality>", or omit model entirely. Everything else is OpenAI-compatible.

POST /v1/chat/completions
Authorization: Bearer sk-aig-...

{ "model": "auto",
  "messages": [
    { "role": "user", "content": "What time zone is Lisbon in?" }
  ] }

// scope to one modality
{ "model": "auto/image", "prompt": "concept sketch of a kayak" }

// set your own ceiling + bias the pick
{ "model": "auto",
  "baseline_model": "anthropic/claude-opus-4.8",
  "messages": [...] }   // + header: x-routing: cost

Supported modalities

Auto spans every generative modality, not just text. Scope a request to one lane with the auto/<modality> form:

Model value	Modality	Endpoint
`auto/text`	Chat / completion	`/v1/chat/completions`
`auto/image`	Image generation	`/v1/images/generations`
`auto/video`	Video generation	`/v1/videos`
`auto/tts`	Text-to-speech	`/v1/audio/speech`
`auto/stt`	Transcription	`/v1/audio/transcriptions`
`auto/music`	Music generation	`/v1/audio/music`
`auto/embedding`	Embeddings	`/v1/embeddings`

Bare model:"auto" (or an omitted model) routes within the modality implied by the endpoint you hit, so a /v1/embeddings call with model:"auto" behaves like auto/embedding.

How routing down works

The router reads the complexity of each request and uses it as the cost-versus-quality knob. A factual lookup reads as low complexity and drops to a cheaper tier; a multi-step reasoning prompt holds near the top of the pool. For text this is a fast classifier over your messages; for generative media it reads prompt richness (length, detail cues like cinematic / photoreal / high-res).

Complexity read	Signals	Tends toward
`simple`	Short, factual, single-turn, no code or multi-step ask	Economy tier
`moderate`	Some length, light reasoning, a few turns of context	Standard tier
`complex`	Long, code-heavy, multi-step, deep conversation, large system prompt	Premium tier (up to your baseline)

The router only ever routes down from the baseline. Candidates more expensive than the baseline are filtered out before scoring, so the worst case is that auto picks the baseline itself — never anything pricier.

Curated pools + quality floor

Auto only selects from a curated, tiered pool per modality — premium, standard, and economy — maintained by hand against public benchmark leaderboards. Every member carries a quality prior in [0, 1], refined by real eval scores where they exist. Candidates below the modality's quality floor are filtered out before routing, so the router trades down on price, never down to garbage.

Quality floor — auto keeps you inside the curated set. Your explicit model ids always reach the full catalog; only auto is scoped to vetted, eval-covered models.

The text pool, for example:

Tier	Members (text)
`premium`	`openai/gpt-5.5-pro`, `anthropic/claude-opus-4.8`, `google/gemini-3.1-pro`, `xai/grok-4`
`standard`	`anthropic/claude-sonnet-4.6`, `moonshot/kimi-k2.7-code`, `google/gemini-3-flash`, `xai/grok-4-fast`
`economy`	`anthropic/claude-haiku-4.5`, `openai/gpt-oss-120b`, `google/gemma-4-26b-a4b-it`, `meta/llama-3.3-70b-instruct-fp8-fast`

Each modality has its own pool and its own baseline. Pools are versioned and revised as models and benchmarks change; you do not pin a pool — you pin a baseline, and the router works the curated set beneath it.

Parameters: baseline_model + x-routing

baseline_model

Every request carries a baseline: the model you would otherwise have called. Set it explicitly in the body with baseline_model, or leave it unset and it defaults to the premium model for that modality. The baseline does double duty:

Measuring stick — savings are computed against what the baseline would have cost for this exact request.
Hard cost ceiling — the router only considers models no more expensive than the baseline, so you can never pay more than the baseline price.

Per-modality default baselines: text → anthropic/claude-opus-4.8, image → recraft/recraftv4-pro, video → google/veo-3.1, tts → minimax/speech-2.8-hd, stt → openai/gpt-4o-transcribe, music → fal-ai/elevenlabs/music, embedding → baai/bge-m3.

x-routing header

Bias the pick with the x-routing request header. It does not change the ceiling — every option stays at or under your baseline — it only changes how the router weighs cost against quality.

`x-routing`	Picks…
`cost`	The cheapest model that clears the quality floor.
`speed`	The fastest qualifying model in the pool.
`quality`	The strongest model still at or under your baseline.
`auto`	Balances cost and quality from the complexity read. Default when the header is absent.

Transparency response headers

Auto is not a black box. Every routed response returns the full picture: the model that ran, why, the complexity read, the quality score, your premium baseline, and the exact dollars saved. Log these and you can audit any single call or chart your savings over time.

Header	Meaning
`X-Auto-Routed`	`true` on any auto-routed call. Absent on explicit-model calls.
`X-Routing-Selected`	The model id that actually ran.
`X-Routing-Reason`	Short trace, e.g. `auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8)`.
`X-Routing-Complexity`	The complexity read: `simple`, `moderate`, or `complex`.
`X-Routing-Quality`	Quality score of the selected model, 0–1, three decimals.
`X-Auto-Baseline-Model`	The premium baseline this call was measured and capped against.
`X-Auto-Baseline-Cost-Cents`	What the baseline would have cost for this exact request, in cents.
`X-Auto-Route-Fee-Cents`	The platform fee on this call, in cents (the savings share).
`X-Auto-Savings-Cents`	Net dollars you saved versus the baseline, after the fee, in cents.

Streaming — the routing-decision headers (X-Routing-Selected, X-Routing-Reason, X-Routing-Complexity, X-Auto-Baseline-Model, X-Auto-Routed) arrive up front, before the first token. Cost and savings headers settle once the stream closes.

Example response headers on a routed text call:

HTTP/1.1 200 OK
X-Auto-Routed: true
X-Routing-Selected: anthropic/claude-haiku-4.5
X-Routing-Reason: auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8)
X-Routing-Complexity: simple
X-Routing-Quality: 0.780
X-Auto-Baseline-Model: anthropic/claude-opus-4.8
X-Auto-Baseline-Cost-Cents: 0.9975
X-Auto-Route-Fee-Cents: 0.2394
X-Auto-Savings-Cents: 0.5586
X-Cost-Cents: 0.4389

Quickstart

Drop auto into the model field of any call you already make. Read the X-Routing-* headers off the response to see what ran.

curl

curl https://api.aigateway.sh/v1/chat/completions -i \
  -H "Authorization: Bearer sk-aig-..." \
  -H "Content-Type: application/json" \
  -H "x-routing: cost" \
  -d '{
    "model": "auto",
    "baseline_model": "anthropic/claude-opus-4.8",
    "messages": [{ "role": "user", "content": "What time zone is Lisbon in?" }]
  }'

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

resp = client.chat.completions.with_raw_response.create(
    model="auto",
    messages=[{"role": "user", "content": "What time zone is Lisbon in?"}],
    extra_body={"baseline_model": "anthropic/claude-opus-4.8"},
    extra_headers={"x-routing": "cost"},
)

# the headers tell you what ran and what you saved
print(resp.headers["x-routing-selected"])      # anthropic/claude-haiku-4.5
print(resp.headers["x-auto-savings-cents"])    # 0.5586
completion = resp.parse()

TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aigateway.sh/v1",
  apiKey: "sk-aig-...",
});

const { data, response } = await client.chat.completions
  .create(
    {
      model: "auto",
      messages: [{ role: "user", content: "What time zone is Lisbon in?" }],
      // baseline_model is an AIgateway extension; pass it through
      baseline_model: "anthropic/claude-opus-4.8",
    } as any,
    { headers: { "x-routing": "cost" } },
  )
  .withResponse();

console.log(response.headers.get("x-routing-selected"));   // anthropic/claude-haiku-4.5
console.log(response.headers.get("x-auto-savings-cents")); // 0.5586

Pricing

The headline is simple: on every routed call you pay less than the premium baseline, guaranteed. Here is the exact mechanic.

On a routed call you pay the selected model's cost (at the standard platform per-call fee) plus a share of what the router saved you:

Base platform fee — the same 5% per-call fee you pay on any AIgateway call, applied to the routed model's pass-through cost.
Savings share — 30% of the dollars the router saved versus your baseline. You keep the other 70% of every dollar saved.
Never above the baseline — by construction, your charge is always at or below what the baseline would have cost. The baseline is a hard ceiling.
$0 fee when there is no saving — if a request routes to the baseline itself (no cheaper qualifying model), there is no saving and no savings-share fee; you pay the plain baseline price.

Concretely, the charge is routed_cost + 30% × max(0, baseline_cost − routed_cost), where both costs already include the 5% per-call fee. Because the share is a fraction of a positive saving, the result is always ≤ baseline_cost, with equality only when there is nothing to save.

Worked example (text)

A simple support lookup, ~400 input + 300 output tokens, with baseline_model: anthropic/claude-opus-4.8 ($5 in / $25 out per Mtok). The router reads low complexity and drops to anthropic/claude-haiku-4.5 ($1 in / $5 out per Mtok):

Line	Per call
Baseline cost (Opus 4.8, incl. fee)	`$0.00997`
Routed model cost (Haiku 4.5, incl. fee)	`$0.00199`
Gross savings vs baseline	`$0.00798`
Savings share (30% of gross)	`$0.00239`
You pay	`$0.00439` (vs $0.00997 baseline)
You keep (70% of savings)	`$0.00559` — 56% cheaper

Across 1,000 such calls that is $4.39 instead of $9.98. The same mechanic applies to media modalities, priced on their billing unit (per image, per second, per 1k characters, per minute) rather than tokens.

BYOK — when you bring your own provider key, AIgateway does not pay the provider, so there is no per-call markup on the model cost. The savings (and the 30% share) are computed on provider list prices, which is your real saving on your own provider bill. See BYOK providers.

For the full pricing model across all modalities, see Pricing.

What to try next

Run a request in the Playground with model:"auto" and watch the headers.
Pair auto with Smart routing + fallbacks so the chosen model also gets the best live route.
Set a tighter ceiling with baseline_model on a per-endpoint basis — e.g. your top coding model on an agent loop.

← PreviousSmart routing Next →Caching