A real 30-day rebuild: semantic cache, complexity-based routing, and per-feature hard caps took one team's GPT-only monthly spend from $4,800 to $1,420 without touching product code. Every lever is a header, a config, or a one-line POST.
This is a real before-and-after. A mid-size team was running a product chat on GPT-5.4 exclusively, no cache, no routing, no per-feature budget. Monthly bill: $4,800. Nobody on the team could tell you which feature was eating the most tokens.
Thirty days of three boring changes later, the bill was $1,420 — same product, same latency, same quality bar. No refactor. No migration. Three headers and one POST.
The levers transfer to almost every team. If you're running above $1,000/month on LLMs and haven't pulled them yet, there's roughly 40-70% on the table.
Before you cut anything, you need to see. Add `x-aig-tag` to every call with the feature name. Ten minutes of work. The dashboard tells you which feature is 41% of the bill — that's the one to rebuild first.
resp = client.chat.completions.create(
model="openai/gpt-5.4",
messages=[...],
extra_headers={"x-aig-tag": "chat.summary"}, # every feature gets one
)
# => GET /v1/usage/by-tag?month=2026-04 now shows:
# chat.summary $1,968 (41%)
# chat.reply $1,340 (28%)
# moderate.inline $ 862 (18%)
# ...For the biggest tag, turn on semantic caching with one header. The gateway embeds the request and serves a cached response when similarity is above threshold. Savings from this lever alone: $1,680/mo.
resp = client.chat.completions.create(
model="openai/gpt-5.4",
messages=[...],
extra_headers={
"x-aig-tag": "chat.summary",
"x-aig-cache": "semantic", # exact + semantic
"x-aig-cache-ttl": "3600", # seconds
"x-aig-cache-similarity": "0.93",
},
)
# Cache hits bill at 10% of the uncached cost.
# Typical semantic hit-rate for a chat summarizer: 30-45%.Most prompts don't need Opus or GPT-5.4. The gateway's built-in classifier ranks each request's complexity and picks a model from a list — Kimi K2.6 on the easy 70%, GPT-5.4 only when the prompt truly needs it. Savings: $1,240/mo.
resp = client.chat.completions.create(
model="openai/gpt-5.4", # upper bound
messages=[...],
extra_headers={
"x-aig-tag": "chat.summary",
"x-aig-route": "auto",
"x-aig-route-tier": "kimi-k2.6,gpt-5-mini,gpt-5.4",
},
)
# Response headers tell you which tier served:
# x-aig-served-by: moonshot/kimi-k2.6
# x-aig-complexity: 0.24 (scale 0-1; threshold auto-set per tag)One user-facing feature in this study was set to upgrade to Opus whenever the input was long. A single abusive user could have produced a $1,000 day. A hard cap on the tag caps the monthly spend at source — over-budget calls return a clean 402 before dispatch. Savings (as a prevention lever): $460/mo.
curl -X POST https://api.aigateway.sh/v1/budgets \
-H "Authorization: Bearer sk-aig-..." \
-d '{ "tag": "chat.summary", "monthly_cap_cents": 150000 }'
# The cap is enforced at the edge. A call over budget:
# HTTP/1.1 402 Payment Required
# x-aig-cap-tag: chat.summary
# x-aig-cap-remaining-cents: 0Caching cuts repeated work — the easiest dollars because they're free. Routing cuts wasted capacity — the biggest dollars because they compound across every call. Caps cut tail-risk — the most important dollars because one bad day can erase a quarter's savings.
Teams often reach for fine-tuning or prompt surgery first. Both can help, but they require engineering time and can hurt quality. The three levers above are header changes; the only side-effect is a smaller bill.
The `/v1/usage/by-tag` endpoint gives you per-feature spend for any window. Pair it with a simple daily-cost alert and you'll know within a day if one feature is drifting — not a month later when the invoice lands.
For finance-grade output, the same endpoint returns CSV when you pass `format=csv`. Hand it to your CFO without touching a spreadsheet.
curl "https://api.aigateway.sh/v1/usage/by-tag?window=30d&format=csv" \ -H "Authorization: Bearer sk-aig-..." > spend.csv
No observable regression on their internal evals after the routing change. The complexity classifier gates on prompt shape, not topic — easy prompts stay easy regardless of user. For the 30% that routed to Kimi, output was rated within-noise of the GPT-5.4 baseline by human reviewers.
Depends on workload. Chat with highly repetitive patterns (summaries, classifications, quick moderations) hits 40-60%. Open-ended creative writing hits 10-15%. The gateway shows you hit-rate per tag so you can see what caching buys before enabling it broadly.
A 300M-parameter edge model (Granite Micro) scores each prompt on a 0-1 scale based on reasoning depth, output length expectation, and input structure. The threshold per tag auto-tunes from the first 100 requests; you can override it with `x-aig-complexity-threshold`.
No — the cap check happens before dispatch. A streaming call that's pre-approved runs to completion; the post-stream usage is debited against the cap. Only the next request over budget returns 402.
You can lift it with a one-line PATCH to the same endpoint. The gateway returns 402 until you do. For production, bind the 402 to a pager; the tag-level cap is the safest early-warning signal you have.
Yes — every lever is a header. The OpenAI client's `extra_headers` parameter passes them through unchanged. No SDK fork, no middleware layer.
Yes, the cache is independent of which key hit the provider. Enterprise tier also ships replay — you can ask the gateway to re-serve any historical request, cached or not, byte-for-byte.
Workloads vary. A rough heuristic: caching saves 20-40% for chat-heavy workloads, 5-10% for generation-heavy workloads. Routing saves 30-50% when you're starting from a frontier-only default and much less when you're already on mid-tier models. Tags + caps prevent the single bad day that usually costs more than either lever saves.