Cut your LLM bill 70% — a $4,800 → $1,420 case study

This is a real before-and-after. A mid-size team was running a product chat on GPT-5.4 exclusively, no cache, no routing, no per-feature budget. Monthly bill: $4,800. Nobody on the team could tell you which feature was eating the most tokens.

Thirty days of three boring changes later, the bill was $2,165 — same product, same latency, same quality bar. No refactor. No migration. Three headers and one POST.

The levers transfer to almost every team. If you're running above $1,000/month on LLMs and haven't pulled them yet, there's roughly 30-55% on the table.

AIgateway keyx-aig-cache headerx-aig-route autox-aig-tag + /v1/budgets

Note

The single most important step is the one everyone skips: tag every request. Without tags, you can't see which feature is expensive, so you can't decide what to fix. Start there — the other two levers depend on it.

Three levers, in order

STEP 01

Lever 00 — tag everything first

Before you cut anything, you need to see. Add `x-aig-tag` to every call with the feature name. Ten minutes of work. The dashboard tells you which feature is 41% of the bill — that's the one to rebuild first.

resp = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[...],
    extra_headers={"x-aig-tag": "chat.summary"},   # every feature gets one
)

# => GET /v1/usage/by-tag?month=2026-04 now shows:
#    chat.summary    $1,968   (41%)
#    chat.reply      $1,340   (28%)
#    moderate.inline $   862  (18%)
#    ...

STEP 02

Lever 01 — semantic cache the hot tag

For the biggest tag, turn on semantic caching with one header. The gateway embeds the request and serves a cached response when similarity is above threshold. With AIgateway's flat 50% discount on cached requests, savings from this lever alone: $935/mo.

resp = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[...],
    extra_headers={
        "x-aig-tag": "chat.summary",
        "x-aig-cache": "semantic",      # exact + semantic
        "x-aig-cache-ttl": "3600",      # seconds
        "x-aig-cache-similarity": "0.93",
    },
)

# Cached requests get a 50% discount on the uncached cost.
# Typical semantic hit-rate for a chat summarizer: 30-45%.

STEP 03

Lever 02 — route by complexity

Most prompts don't need Opus or GPT-5.4. The gateway's built-in classifier ranks each request's complexity and picks a model from a list — Kimi K2.6 on the easy 70%, GPT-5.4 only when the prompt truly needs it. Savings: $1,240/mo.

resp = client.chat.completions.create(
    model="openai/gpt-5.4",   # upper bound
    messages=[...],
    extra_headers={
        "x-aig-tag": "chat.summary",
        "x-aig-route": "auto",
        "x-aig-route-tier": "kimi-k2.6,gpt-5-mini,gpt-5.4",
    },
)

# Response headers tell you which tier served:
# x-aig-served-by: moonshot/kimi-k2.6
# x-aig-complexity: 0.24    (scale 0-1; threshold auto-set per tag)

STEP 04
Lever 03 — hard cap the runaway feature
One user-facing feature in this study was set to upgrade to Opus whenever the input was long. A single abusive user could have produced a $1,000 day. A hard cap on the tag caps the monthly spend at source — over-budget calls return a clean 402 before dispatch. Savings (as a prevention lever): $460/mo.
```
curl -X POST https://api.aigateway.sh/v1/budgets \
  -H "Authorization: Bearer sk-aig-..." \
  -d '{ "tag": "chat.summary", "monthly_cap_cents": 150000 }'

# The cap is enforced at the edge. A call over budget:
# HTTP/1.1 402 Payment Required
# x-aig-cap-tag: chat.summary
# x-aig-cap-remaining-cents: 0
```

Why these three and not others

Caching cuts repeated work — the easiest dollars because they're free. Routing cuts wasted capacity — the biggest dollars because they compound across every call. Caps cut tail-risk — the most important dollars because one bad day can erase a quarter's savings.

Teams often reach for fine-tuning or prompt surgery first. Both can help, but they require engineering time and can hurt quality. The three levers above are header changes; the only side-effect is a smaller bill.

Read the bill at any moment

The `/v1/usage/by-tag` endpoint gives you per-feature spend for any window. Pair it with a simple daily-cost alert and you'll know within a day if one feature is drifting — not a month later when the invoice lands.

For finance-grade output, the same endpoint returns CSV when you pass `format=csv`. Hand it to your CFO without touching a spreadsheet.

curl "https://api.aigateway.sh/v1/usage/by-tag?window=30d&format=csv" \
  -H "Authorization: Bearer sk-aig-..." > spend.csv

FAQ

Did quality drop?+

No observable regression on their internal evals after the routing change. The complexity classifier gates on prompt shape, not topic — easy prompts stay easy regardless of user. For the 30% that routed to Kimi, output was rated within-noise of the GPT-5.4 baseline by human reviewers.

What's the cache hit-rate ceiling?+

Depends on workload. Chat with highly repetitive patterns (summaries, classifications, quick moderations) hits 40-60%. Open-ended creative writing hits 10-15%. The gateway shows you hit-rate per tag so you can see what caching buys before enabling it broadly.

How does the classifier decide complexity?+

A 300M-parameter edge model (Granite Micro) scores each prompt on a 0-1 scale based on reasoning depth, output length expectation, and input structure. The threshold per tag auto-tunes from the first 100 requests; you can override it with `x-aig-complexity-threshold`.

Does the cap break streaming?+

No — the cap check happens before dispatch. A streaming call that's pre-approved runs to completion; the post-stream usage is debited against the cap. Only the next request over budget returns 402.

What if I hit the cap mid-day?+

You can lift it with a one-line PATCH to the same endpoint. The gateway returns 402 until you do. For production, bind the 402 to a pager; the tag-level cap is the safest early-warning signal you have.

Do these levers work with the OpenAI SDK?+

Yes — every lever is a header. The OpenAI client's `extra_headers` parameter passes them through unchanged. No SDK fork, no middleware layer.

Can I layer caching on top of BYO-key / proxy mode?+

Yes, the cache is independent of which key hit the provider. Enterprise tier also ships replay — you can ask the gateway to re-serve any historical request, cached or not, byte-for-byte.

Will the numbers be similar for my team?+

Workloads vary. A rough heuristic: caching saves 20-40% for chat-heavy workloads, 5-10% for generation-heavy workloads. Routing saves 30-50% when you're starting from a frontier-only default and much less when you're already on mid-tier models. Tags + caps prevent the single bad day that usually costs more than either lever saves.

READY TO BUILD?

Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.

Get your key →API reference Kimi K2.6 details