Question 1

How does AIgateway caching work?

Accepted Answer

Three tiers: L1 exact-match on the full request body, L2 semantic similarity over the prompt, and L3 prompt-prefix caching through the upstream provider. All three check before dispatch, so cached requests get a 50% discount.

Question 2

How much does a cached request cost on AIgateway?

Accepted Answer

It depends on the cache tier. Gateway cache (L1 exact-match / L2 semantic): the full response is served from cache and you pay 50% of the total model cost. Provider prompt cache (L3): the provider caches the prompt prefix — cached input tokens bill at 50% of the normal input rate while output tokens bill at the full rate. Cache writes are free on AIgateway's own caches; provider-level caches follow the upstream's setup-token rules.

Question 3

How can I see cached token savings?

Accepted Answer

Every response includes usage.prompt_tokens_details.cached_tokens (body) and X-Cached-Input-Units (header) when provider prompt caching reduced your cost. The usage block also breaks out cache_read_input_tokens and cache_creation_input_tokens.

Question 4

Can I disable or force the AIgateway cache for one request?

Accepted Answer

Yes. Send x-cache: skip to bypass, x-cache: force to require a cache hit (returns 404-style error if miss), or x-cache: auto (default) to use all three tiers.

Question 5

Does semantic caching hurt answer quality?

Accepted Answer

It's disabled by default because of that risk. Enable it per sub-account or per request (x-cache-semantic: on). It's best for agents that re-ask near-identical questions — the similarity threshold is configurable.

TTL	Use for
`"5m"` (default)	Multi-turn conversations, same session.
`"1h"`	Workflows where the same system prompt runs for up to an hour.
`"24h"` (Anthropic only)	Stable system prompts and knowledge bases.

Prompt caching

How it works

Marking a cacheable block

Supported TTLs

Reading the savings

Rules of the road