Inference

Reasoning models

Reasoning-capable models (DeepSeek R1, Kimi K2.6, OpenAI o-series, Claude Opus 4.8 with adaptive thinking, Gemini 3 Pro with thought summaries) return their chain-of-thought separately from the final answer. We normalize every provider's convention into a single field — reasoning_content on the assistant message — so your code stays portable across models.

Response shape

// non-streaming
{ "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think step by step..."
    }
  }] }

// streaming — reasoning and content flow as separate deltas
{ "choices": [{ "delta": { "reasoning_content": "First, " } }] }
{ "choices": [{ "delta": { "reasoning_content": "consider..." } }] }
{ "choices": [{ "delta": { "content": "The answer" } }] }
{ "choices": [{ "delta": { "content": " is 42." } }] }

Render the reasoning trace in a collapsible block (the playground ships a reference UI), or drop it entirely if you don't want it surfaced. Non-reasoning models simply omit the field.

Controlling reasoning effort

Thinking-capable models accept a single reasoning_effort parameter that controls how deeply the model reasons — and, on models that support it, how many tokens it spends overall. One knob, same field, every provider:

"none" — skip the thinking pass entirely; fastest and cheapest.
"low" / "medium" / "high" — progressively deeper reasoning. "high" is the default when you omit the field.
"xhigh" / "max" — extended depth for long-horizon agentic and coding work, on the models that support them (Claude Opus). On models that don't, they clamp down to "high" — so the same request stays portable.

Higher effort produces more carefully-reasoned output at the cost of more tokens and latency. On the latest Claude models reasoning depth is calibrated adaptively per request — reasoning_effort sets the ceiling, and the model thinks only as much as the task needs.

{ "model": "anthropic/claude-opus-4.8",
  "messages": [{ "role": "user", "content": "explain RSA in 3 steps" }],
  "reasoning_effort": "medium" }

Billing

Reasoning tokens are billed at the same completion_tokens rate as the final answer — they show up in usage.completion_tokens_detailsso you can separate them in your accounting. DeepSeek and Kimi are materially cheaper for long-reasoning workloads.

Don't feed reasoning back in

Reasoning traces are not supposed to be turn-2 context. For multi-turn conversations, only pass back message.content — dropping reasoning_content keeps your context cleaner and your prompt costs lower. OpenAI specifically voids model safety guarantees if you feed o-series its own reasoning back.

← PreviousStreaming Next →Tool calling