examples/cost + ops
Case study · 30 days · -70%

Cut your LLM bill 70% — a $4,800 → $1,420 case study

A real 30-day rebuild: semantic cache, complexity-based routing, and per-feature hard caps took one team's GPT-only monthly spend from $4,800 to $1,420 without touching product code. Every lever is a header, a config, or a one-line POST.

9 min readpublished 2026-04-25category · Cost + ops
Before-and-after monthly LLM bill — $4,800 reduced to $1,420 with three levers

This is a real before-and-after. A mid-size team was running a product chat on GPT-5.4 exclusively, no cache, no routing, no per-feature budget. Monthly bill: $4,800. Nobody on the team could tell you which feature was eating the most tokens.

Thirty days of three boring changes later, the bill was $1,420 — same product, same latency, same quality bar. No refactor. No migration. Three headers and one POST.

The levers transfer to almost every team. If you're running above $1,000/month on LLMs and haven't pulled them yet, there's roughly 40-70% on the table.

AIgateway keyx-aig-cache headerx-aig-route autox-aig-tag + /v1/budgets
Note
The single most important step is the one everyone skips: tag every request. Without tags, you can't see which feature is expensive, so you can't decide what to fix. Start there — the other two levers depend on it.

Three levers, in order

  1. STEP 01

    Lever 00 — tag everything first

    Before you cut anything, you need to see. Add `x-aig-tag` to every call with the feature name. Ten minutes of work. The dashboard tells you which feature is 41% of the bill — that's the one to rebuild first.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",
        messages=[...],
        extra_headers={"x-aig-tag": "chat.summary"},   # every feature gets one
    )
    
    # => GET /v1/usage/by-tag?month=2026-04 now shows:
    #    chat.summary    $1,968   (41%)
    #    chat.reply      $1,340   (28%)
    #    moderate.inline $   862  (18%)
    #    ...
  2. STEP 02

    Lever 01 — semantic cache the hot tag

    For the biggest tag, turn on semantic caching with one header. The gateway embeds the request and serves a cached response when similarity is above threshold. Savings from this lever alone: $1,680/mo.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",
        messages=[...],
        extra_headers={
            "x-aig-tag": "chat.summary",
            "x-aig-cache": "semantic",      # exact + semantic
            "x-aig-cache-ttl": "3600",      # seconds
            "x-aig-cache-similarity": "0.93",
        },
    )
    
    # Cache hits bill at 10% of the uncached cost.
    # Typical semantic hit-rate for a chat summarizer: 30-45%.
  3. STEP 03

    Lever 02 — route by complexity

    Most prompts don't need Opus or GPT-5.4. The gateway's built-in classifier ranks each request's complexity and picks a model from a list — Kimi K2.6 on the easy 70%, GPT-5.4 only when the prompt truly needs it. Savings: $1,240/mo.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",   # upper bound
        messages=[...],
        extra_headers={
            "x-aig-tag": "chat.summary",
            "x-aig-route": "auto",
            "x-aig-route-tier": "kimi-k2.6,gpt-5-mini,gpt-5.4",
        },
    )
    
    # Response headers tell you which tier served:
    # x-aig-served-by: moonshot/kimi-k2.6
    # x-aig-complexity: 0.24    (scale 0-1; threshold auto-set per tag)
  4. STEP 04

    Lever 03 — hard cap the runaway feature

    One user-facing feature in this study was set to upgrade to Opus whenever the input was long. A single abusive user could have produced a $1,000 day. A hard cap on the tag caps the monthly spend at source — over-budget calls return a clean 402 before dispatch. Savings (as a prevention lever): $460/mo.

    curl -X POST https://api.aigateway.sh/v1/budgets \
      -H "Authorization: Bearer sk-aig-..." \
      -d '{ "tag": "chat.summary", "monthly_cap_cents": 150000 }'
    
    # The cap is enforced at the edge. A call over budget:
    # HTTP/1.1 402 Payment Required
    # x-aig-cap-tag: chat.summary
    # x-aig-cap-remaining-cents: 0

Why these three and not others

Caching cuts repeated work — the easiest dollars because they're free. Routing cuts wasted capacity — the biggest dollars because they compound across every call. Caps cut tail-risk — the most important dollars because one bad day can erase a quarter's savings.

Teams often reach for fine-tuning or prompt surgery first. Both can help, but they require engineering time and can hurt quality. The three levers above are header changes; the only side-effect is a smaller bill.

Read the bill at any moment

The `/v1/usage/by-tag` endpoint gives you per-feature spend for any window. Pair it with a simple daily-cost alert and you'll know within a day if one feature is drifting — not a month later when the invoice lands.

For finance-grade output, the same endpoint returns CSV when you pass `format=csv`. Hand it to your CFO without touching a spreadsheet.

curl "https://api.aigateway.sh/v1/usage/by-tag?window=30d&format=csv" \
  -H "Authorization: Bearer sk-aig-..." > spend.csv

FAQ

Did quality drop?+

No observable regression on their internal evals after the routing change. The complexity classifier gates on prompt shape, not topic — easy prompts stay easy regardless of user. For the 30% that routed to Kimi, output was rated within-noise of the GPT-5.4 baseline by human reviewers.

What's the cache hit-rate ceiling?+

Depends on workload. Chat with highly repetitive patterns (summaries, classifications, quick moderations) hits 40-60%. Open-ended creative writing hits 10-15%. The gateway shows you hit-rate per tag so you can see what caching buys before enabling it broadly.

How does the classifier decide complexity?+

A 300M-parameter edge model (Granite Micro) scores each prompt on a 0-1 scale based on reasoning depth, output length expectation, and input structure. The threshold per tag auto-tunes from the first 100 requests; you can override it with `x-aig-complexity-threshold`.

Does the cap break streaming?+

No — the cap check happens before dispatch. A streaming call that's pre-approved runs to completion; the post-stream usage is debited against the cap. Only the next request over budget returns 402.

What if I hit the cap mid-day?+

You can lift it with a one-line PATCH to the same endpoint. The gateway returns 402 until you do. For production, bind the 402 to a pager; the tag-level cap is the safest early-warning signal you have.

Do these levers work with the OpenAI SDK?+

Yes — every lever is a header. The OpenAI client's `extra_headers` parameter passes them through unchanged. No SDK fork, no middleware layer.

Can I layer caching on top of BYO-key / proxy mode?+

Yes, the cache is independent of which key hit the provider. Enterprise tier also ships replay — you can ask the gateway to re-serve any historical request, cached or not, byte-for-byte.

Will the numbers be similar for my team?+

Workloads vary. A rough heuristic: caching saves 20-40% for chat-heavy workloads, 5-10% for generation-heavy workloads. Routing saves 30-50% when you're starting from a frontier-only default and much less when you're already on mid-tier models. Tags + caps prevent the single bad day that usually costs more than either lever saves.

READY TO BUILD?
Get an AIgateway key in 30 seconds. Free Kimi K2.6 through Apr 30, 2026; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples