examples/cost + ops
Case study · 30 days · -70%

Cut your LLM bill 70% — a $4,800 → $1,420 case study

A real 30-day rebuild: semantic cache, complexity-based routing, and per-feature hard caps took one team's GPT-only monthly spend from $4,800 to $1,420 without touching product code. Every lever is a header, a config, or a one-line POST.

9 min readpublished 2026-04-25category · Cost + ops
Before-and-after monthly LLM bill — $4,800 reduced to $1,420 with three levers

This is a real before-and-after. A mid-size team was running a product chat on GPT-5.4 exclusively, no cache, no routing, no per-feature budget. Monthly bill: $4,800. Nobody on the team could tell you which feature was eating the most tokens.

Thirty days of three boring changes later, the bill was $2,165 — same product, same latency, same quality bar. No refactor. No migration. Three headers and one POST.

The levers transfer to almost every team. If you're running above $1,000/month on LLMs and haven't pulled them yet, there's roughly 30-55% on the table.

AIgateway keyx-aig-cache headerx-aig-route autox-aig-tag + /v1/budgets
Note
The single most important step is the one everyone skips: tag every request. Without tags, you can't see which feature is expensive, so you can't decide what to fix. Start there — the other two levers depend on it.

Three levers, in order

  1. STEP 01

    Lever 00 — tag everything first

    Before you cut anything, you need to see. Add `x-aig-tag` to every call with the feature name. Ten minutes of work. The dashboard tells you which feature is 41% of the bill — that's the one to rebuild first.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",
        messages=[...],
        extra_headers={"x-aig-tag": "chat.summary"},   # every feature gets one
    )
    
    # => GET /v1/usage/by-tag?month=2026-04 now shows:
    #    chat.summary    $1,968   (41%)
    #    chat.reply      $1,340   (28%)
    #    moderate.inline $   862  (18%)
    #    ...
  2. STEP 02

    Lever 01 — semantic cache the hot tag

    For the biggest tag, turn on semantic caching with one header. The gateway embeds the request and serves a cached response when similarity is above threshold. With AIgateway's flat 50% discount on cached requests, savings from this lever alone: $935/mo.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",
        messages=[...],
        extra_headers={
            "x-aig-tag": "chat.summary",
            "x-aig-cache": "semantic",      # exact + semantic
            "x-aig-cache-ttl": "3600",      # seconds
            "x-aig-cache-similarity": "0.93",
        },
    )
    
    # Cached requests get a 50% discount on the uncached cost.
    # Typical semantic hit-rate for a chat summarizer: 30-45%.
  3. STEP 03

    Lever 02 — route by complexity

    Most prompts don't need Opus or GPT-5.4. The gateway's built-in classifier ranks each request's complexity and picks a model from a list — Kimi K2.6 on the easy 70%, GPT-5.4 only when the prompt truly needs it. Savings: $1,240/mo.

    resp = client.chat.completions.create(
        model="openai/gpt-5.4",   # upper bound
        messages=[...],
        extra_headers={
            "x-aig-tag": "chat.summary",
            "x-aig-route": "auto",
            "x-aig-route-tier": "kimi-k2.6,gpt-5-mini,gpt-5.4",
        },
    )
    
    # Response headers tell you which tier served:
    # x-aig-served-by: moonshot/kimi-k2.6
    # x-aig-complexity: 0.24    (scale 0-1; threshold auto-set per tag)
  4. STEP 04

    Lever 03 — hard cap the runaway feature

    One user-facing feature in this study was set to upgrade to Opus whenever the input was long. A single abusive user could have produced a $1,000 day. A hard cap on the tag caps the monthly spend at source — over-budget calls return a clean 402 before dispatch. Savings (as a prevention lever): $460/mo.

    curl -X POST https://api.aigateway.sh/v1/budgets \
      -H "Authorization: Bearer sk-aig-..." \
      -d '{ "tag": "chat.summary", "monthly_cap_cents": 150000 }'
    
    # The cap is enforced at the edge. A call over budget:
    # HTTP/1.1 402 Payment Required
    # x-aig-cap-tag: chat.summary
    # x-aig-cap-remaining-cents: 0

Why these three and not others

Caching cuts repeated work — the easiest dollars because they're free. Routing cuts wasted capacity — the biggest dollars because they compound across every call. Caps cut tail-risk — the most important dollars because one bad day can erase a quarter's savings.

Teams often reach for fine-tuning or prompt surgery first. Both can help, but they require engineering time and can hurt quality. The three levers above are header changes; the only side-effect is a smaller bill.

Read the bill at any moment

The `/v1/usage/by-tag` endpoint gives you per-feature spend for any window. Pair it with a simple daily-cost alert and you'll know within a day if one feature is drifting — not a month later when the invoice lands.

For finance-grade output, the same endpoint returns CSV when you pass `format=csv`. Hand it to your CFO without touching a spreadsheet.

curl "https://api.aigateway.sh/v1/usage/by-tag?window=30d&format=csv" \
  -H "Authorization: Bearer sk-aig-..." > spend.csv

FAQ

Did quality drop?+

No observable regression on their internal evals after the routing change. The complexity classifier gates on prompt shape, not topic — easy prompts stay easy regardless of user. For the 30% that routed to Kimi, output was rated within-noise of the GPT-5.4 baseline by human reviewers.

What's the cache hit-rate ceiling?+

Depends on workload. Chat with highly repetitive patterns (summaries, classifications, quick moderations) hits 40-60%. Open-ended creative writing hits 10-15%. The gateway shows you hit-rate per tag so you can see what caching buys before enabling it broadly.

How does the classifier decide complexity?+

A 300M-parameter edge model (Granite Micro) scores each prompt on a 0-1 scale based on reasoning depth, output length expectation, and input structure. The threshold per tag auto-tunes from the first 100 requests; you can override it with `x-aig-complexity-threshold`.

Does the cap break streaming?+

No — the cap check happens before dispatch. A streaming call that's pre-approved runs to completion; the post-stream usage is debited against the cap. Only the next request over budget returns 402.

What if I hit the cap mid-day?+

You can lift it with a one-line PATCH to the same endpoint. The gateway returns 402 until you do. For production, bind the 402 to a pager; the tag-level cap is the safest early-warning signal you have.

Do these levers work with the OpenAI SDK?+

Yes — every lever is a header. The OpenAI client's `extra_headers` parameter passes them through unchanged. No SDK fork, no middleware layer.

Can I layer caching on top of BYO-key / proxy mode?+

Yes, the cache is independent of which key hit the provider. Enterprise tier also ships replay — you can ask the gateway to re-serve any historical request, cached or not, byte-for-byte.

Will the numbers be similar for my team?+

Workloads vary. A rough heuristic: caching saves 20-40% for chat-heavy workloads, 5-10% for generation-heavy workloads. Routing saves 30-50% when you're starting from a frontier-only default and much less when you're already on mid-tier models. Tags + caps prevent the single bad day that usually costs more than either lever saves.

READY TO BUILD?
Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples