View as /docs.md
Platform

Prompt caching

Reuse long system prompts, tool definitions, and RAG contexts across requests at a fraction of the normal input cost. The gateway implements caching over every provider that supports it (Anthropic, OpenAI, Gemini, DeepSeek, Groq) and gives you a unified shape — so your code doesn't branch per provider.

How it works

Mark any message or tool with cache_control. On the first request, we write it to the provider's prefix cache (small one-time fee). On every subsequent request within the TTL that has the same prefix, we pay only the cache-read rate — typically 10% of normal input cost, and we return to you at the same 10% markup.

Marking a cacheable block

{
  "model": "anthropic/claude-4.6-sonnet",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "<long 12k-token system prompt>",
          "cache_control": { "type": "ephemeral", "ttl": "1h" }
        }
      ]
    },
    { "role": "user", "content": "Summarize..." }
  ]
}

Supported TTLs

TTLUse for
"5m" (default)Multi-turn conversations, same session.
"1h"Workflows where the same system prompt runs for up to an hour.
"24h" (Anthropic only)Stable system prompts and knowledge bases.

Reading the savings

The usage block in every response breaks out cache_read_input_tokens and cache_creation_input_tokens. Regular input tokens, cache reads, and cache writes are all billed separately and visible in your dashboard → Usage.

Rules of the road