Prompt caching
Reuse long system prompts, tool definitions, and RAG contexts across requests at a fraction of the normal input cost. The gateway implements caching over every provider that supports it (Anthropic, OpenAI, Gemini, DeepSeek, Groq) and gives you a unified shape — so your code doesn't branch per provider.
How it works
Mark any message or tool with cache_control. On the first request, we write it to the provider's prefix cache (small one-time fee). On every subsequent request within the TTL that has the same prefix, we pay only the cache-read rate — typically 10% of normal input cost, and we return to you at the same 10% markup.
Marking a cacheable block
{
"model": "anthropic/claude-4.6-sonnet",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "<long 12k-token system prompt>",
"cache_control": { "type": "ephemeral", "ttl": "1h" }
}
]
},
{ "role": "user", "content": "Summarize..." }
]
}Supported TTLs
| TTL | Use for |
|---|---|
"5m" (default) | Multi-turn conversations, same session. |
"1h" | Workflows where the same system prompt runs for up to an hour. |
"24h" (Anthropic only) | Stable system prompts and knowledge bases. |
Reading the savings
The usage block in every response breaks out cache_read_input_tokens and cache_creation_input_tokens. Regular input tokens, cache reads, and cache writes are all billed separately and visible in your dashboard → Usage.
Rules of the road
- Order matters — cached blocks must stay at the start of the prompt.
- Minimum cache size is ~1,024 tokens on most providers.
- Up to 4 cache breakpoints per request (Anthropic).
- On providers without native caching (Mistral, Cohere) we silently no-op — you pay normal input price, no error thrown.