View as /docs.md
Platform

Auto Router

Set model:"auto" and the router reads each request, picks the cheapest model in a curated pool that still clears the quality floor, and never charges you more than the premium model you would have called yourself. It works across text, image, video, speech, transcription, music, and embeddings — one field covers everything you build. Every routed response tells you exactly what ran and what you saved.

This is distinct from Smart routing + fallbacks, which picks the best live provider for one model ID and fails over when a provider is unhealthy. The Auto Router instead selects which model runs for a request based on its complexity. You can use both: auto picks the model, smart routing then picks the route to it.

Usage

The only change from a normal call is the model value. Send model:"auto" to route by modality inferred from the endpoint, scope it to a specific modality with model:"auto/<modality>", or omit model entirely. Everything else is OpenAI-compatible.

POST /v1/chat/completions
Authorization: Bearer sk-aig-...

{ "model": "auto",
  "messages": [
    { "role": "user", "content": "What time zone is Lisbon in?" }
  ] }

// scope to one modality
{ "model": "auto/image", "prompt": "concept sketch of a kayak" }

// set your own ceiling + bias the pick
{ "model": "auto",
  "baseline_model": "anthropic/claude-opus-4.8",
  "messages": [...] }   // + header: x-routing: cost

Supported modalities

Auto spans every generative modality, not just text. Scope a request to one lane with the auto/<modality> form:

Model valueModalityEndpoint
auto/textChat / completion/v1/chat/completions
auto/imageImage generation/v1/images/generations
auto/videoVideo generation/v1/videos
auto/ttsText-to-speech/v1/audio/speech
auto/sttTranscription/v1/audio/transcriptions
auto/musicMusic generation/v1/audio/music
auto/embeddingEmbeddings/v1/embeddings

Bare model:"auto" (or an omitted model) routes within the modality implied by the endpoint you hit, so a /v1/embeddings call with model:"auto" behaves like auto/embedding.

How routing down works

The router reads the complexity of each request and uses it as the cost-versus-quality knob. A factual lookup reads as low complexity and drops to a cheaper tier; a multi-step reasoning prompt holds near the top of the pool. For text this is a fast classifier over your messages; for generative media it reads prompt richness (length, detail cues like cinematic / photoreal / high-res).

Complexity readSignalsTends toward
simpleShort, factual, single-turn, no code or multi-step askEconomy tier
moderateSome length, light reasoning, a few turns of contextStandard tier
complexLong, code-heavy, multi-step, deep conversation, large system promptPremium tier (up to your baseline)

The router only ever routes down from the baseline. Candidates more expensive than the baseline are filtered out before scoring, so the worst case is that auto picks the baseline itself — never anything pricier.

Curated pools + quality floor

Auto only selects from a curated, tiered pool per modality — premium, standard, and economy — maintained by hand against public benchmark leaderboards. Every member carries a quality prior in [0, 1], refined by real eval scores where they exist. Candidates below the modality's quality floor are filtered out before routing, so the router trades down on price, never down to garbage.

Quality floor — auto keeps you inside the curated set. Your explicit model ids always reach the full catalog; only auto is scoped to vetted, eval-covered models.

The text pool, for example:

TierMembers (text)
premiumopenai/gpt-5.5-pro, anthropic/claude-opus-4.8, google/gemini-3.1-pro, xai/grok-4
standardanthropic/claude-sonnet-4.6, moonshot/kimi-k2.6, google/gemini-3-flash, xai/grok-4-fast
economyanthropic/claude-haiku-4.5, openai/gpt-oss-120b, google/gemma-4-26b-a4b-it, meta/llama-3.3-70b-instruct-fp8-fast

Each modality has its own pool and its own baseline. Pools are versioned and revised as models and benchmarks change; you do not pin a pool — you pin a baseline, and the router works the curated set beneath it.

Parameters: baseline_model + x-routing

baseline_model

Every request carries a baseline: the model you would otherwise have called. Set it explicitly in the body with baseline_model, or leave it unset and it defaults to the premium model for that modality. The baseline does double duty:

Per-modality default baselines: text anthropic/claude-opus-4.8, image recraft/recraftv4-pro, video google/veo-3.1, tts minimax/speech-2.8-hd, stt openai/gpt-4o-transcribe, music fal-ai/elevenlabs/music, embedding baai/bge-m3.

x-routing header

Bias the pick with the x-routing request header. It does not change the ceiling — every option stays at or under your baseline — it only changes how the router weighs cost against quality.

x-routingPicks…
costThe cheapest model that clears the quality floor.
speedThe fastest qualifying model in the pool.
qualityThe strongest model still at or under your baseline.
autoBalances cost and quality from the complexity read. Default when the header is absent.

Transparency response headers

Auto is not a black box. Every routed response returns the full picture: the model that ran, why, the complexity read, the quality score, your premium baseline, and the exact dollars saved. Log these and you can audit any single call or chart your savings over time.

HeaderMeaning
X-Auto-Routedtrue on any auto-routed call. Absent on explicit-model calls.
X-Routing-SelectedThe model id that actually ran.
X-Routing-ReasonShort trace, e.g. auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8).
X-Routing-ComplexityThe complexity read: simple, moderate, or complex.
X-Routing-QualityQuality score of the selected model, 0–1, three decimals.
X-Auto-Baseline-ModelThe premium baseline this call was measured and capped against.
X-Auto-Baseline-Cost-CentsWhat the baseline would have cost for this exact request, in cents.
X-Auto-Route-Fee-CentsThe platform fee on this call, in cents (the savings share).
X-Auto-Savings-CentsNet dollars you saved versus the baseline, after the fee, in cents.
Streaming — the routing-decision headers (X-Routing-Selected, X-Routing-Reason, X-Routing-Complexity, X-Auto-Baseline-Model, X-Auto-Routed) arrive up front, before the first token. Cost and savings headers settle once the stream closes.

Example response headers on a routed text call:

HTTP/1.1 200 OK
X-Auto-Routed: true
X-Routing-Selected: anthropic/claude-haiku-4.5
X-Routing-Reason: auto simple -> anthropic/claude-haiku-4.5 (vs anthropic/claude-opus-4.8)
X-Routing-Complexity: simple
X-Routing-Quality: 0.780
X-Auto-Baseline-Model: anthropic/claude-opus-4.8
X-Auto-Baseline-Cost-Cents: 0.9975
X-Auto-Route-Fee-Cents: 0.2394
X-Auto-Savings-Cents: 0.5586
X-Cost-Cents: 0.4389

Quickstart

Drop auto into the model field of any call you already make. Read the X-Routing-* headers off the response to see what ran.

curl

curl https://api.aigateway.sh/v1/chat/completions -i \
  -H "Authorization: Bearer sk-aig-..." \
  -H "Content-Type: application/json" \
  -H "x-routing: cost" \
  -d '{
    "model": "auto",
    "baseline_model": "anthropic/claude-opus-4.8",
    "messages": [{ "role": "user", "content": "What time zone is Lisbon in?" }]
  }'

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

resp = client.chat.completions.with_raw_response.create(
    model="auto",
    messages=[{"role": "user", "content": "What time zone is Lisbon in?"}],
    extra_body={"baseline_model": "anthropic/claude-opus-4.8"},
    extra_headers={"x-routing": "cost"},
)

# the headers tell you what ran and what you saved
print(resp.headers["x-routing-selected"])      # anthropic/claude-haiku-4.5
print(resp.headers["x-auto-savings-cents"])    # 0.5586
completion = resp.parse()

TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aigateway.sh/v1",
  apiKey: "sk-aig-...",
});

const { data, response } = await client.chat.completions
  .create(
    {
      model: "auto",
      messages: [{ role: "user", content: "What time zone is Lisbon in?" }],
      // baseline_model is an AIgateway extension; pass it through
      baseline_model: "anthropic/claude-opus-4.8",
    } as any,
    { headers: { "x-routing": "cost" } },
  )
  .withResponse();

console.log(response.headers.get("x-routing-selected"));   // anthropic/claude-haiku-4.5
console.log(response.headers.get("x-auto-savings-cents")); // 0.5586

Pricing

The headline is simple: on every routed call you pay less than the premium baseline, guaranteed. Here is the exact mechanic.

On a routed call you pay the selected model's cost (at the standard platform per-call fee) plus a share of what the router saved you:

Concretely, the charge is routed_cost + 30% × max(0, baseline_cost − routed_cost), where both costs already include the 5% per-call fee. Because the share is a fraction of a positive saving, the result is always ≤ baseline_cost, with equality only when there is nothing to save.

Worked example (text)

A simple support lookup, ~400 input + 300 output tokens, with baseline_model: anthropic/claude-opus-4.8 ($5 in / $25 out per Mtok). The router reads low complexity and drops to anthropic/claude-haiku-4.5 ($1 in / $5 out per Mtok):

LinePer call
Baseline cost (Opus 4.8, incl. fee)$0.00997
Routed model cost (Haiku 4.5, incl. fee)$0.00199
Gross savings vs baseline$0.00798
Savings share (30% of gross)$0.00239
You pay$0.00439 (vs $0.00997 baseline)
You keep (70% of savings)$0.00559 — 56% cheaper

Across 1,000 such calls that is $4.39 instead of $9.98. The same mechanic applies to media modalities, priced on their billing unit (per image, per second, per 1k characters, per minute) rather than tokens.

BYOK — when you bring your own provider key, AIgateway does not pay the provider, so there is no per-call markup on the model cost. The savings (and the 30% share) are computed on provider list prices, which is your real saving on your own provider bill. See BYOK providers.

For the full pricing model across all modalities, see Pricing.

What to try next