Guide

GPT-5.4

OpenAI's frontier model. 400K context, 128K output, native JSON mode, tight tool calling, strong math and code. Reasoning-grade — so some of its knobs differ from older GPT-4-era models.

Quickstart

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

r = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[{"role": "user", "content": "Solve: if f(x) = x^3 - 2x + 1, find f'(2)."}],
    max_completion_tokens=2048,   # NOTE: not max_tokens
)
print(r.choices[0].message.content)

Model card

Slug: openai/gpt-5.4
Provider: OpenAI
Released: 2026-03-05
Context window: 400,000 tokens
Max output: 128,000 tokens
Modality: Text + vision
Capabilities: Streaming, tool calling, JSON mode, structured outputs, batch, caching, reasoning
Pricing: $2.50 / 1M input, $15.00 / 1M output, $0.25 / 1M cache reads. Pass-through — 5% fee at credit top-up.

Two knobs that are different on GPT-5.x

Use max_completion_tokens, not max_tokens. GPT-5.x and o-series reasoning models renamed this. Our gateway accepts either and translates, but sending max_completion_tokens directly avoids any ambiguity.
Sampling controls are limited. temperature and top_p are accepted but have reduced effect vs GPT-4-class models — the model reasons internally. For deterministic output rely on response_format and tool schemas instead of temperature=0.

Request

{
  "model": "openai/gpt-5.4",
  "messages": [
    { "role": "system", "content": "You are a careful analyst." },
    { "role": "user",   "content": "..." }
  ],
  "max_completion_tokens": 4096,
  "stream": false,

  "tools": [ /* OpenAI function spec */ ],
  "tool_choice": "auto",
  "parallel_tool_calls": true,

  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "invoice",
      "schema": {
        "type": "object",
        "properties": {
          "total_cents": { "type": "integer" },
          "line_items": {
            "type": "array",
            "items": { "type": "object" }
          }
        },
        "required": ["total_cents", "line_items"]
      },
      "strict": true
    }
  }
}

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "openai/gpt-5.4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"total_cents\": 12345, \"line_items\": [...]}",
        "reasoning_content": "Parsing the invoice..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 240,
    "completion_tokens": 180,
    "total_tokens": 420
  }
}

Structured outputs (strict mode)

GPT-5.4 enforces JSON schemas at the decoder level when strict: true. The response is guaranteed to parse against your schema — no post-hoc validation needed. This is the killer feature vs older GPT models:

# Payload guaranteed to parse — no try/except around json.loads
data = json.loads(r.choices[0].message.content)
assert isinstance(data["total_cents"], int)

Tool calling + parallel calls

GPT-5.4 excels at emitting multiple parallel tool calls in a single turn. Set parallel_tool_calls: true (default) and execute them concurrently client-side.

Use GPT-5.4 in Cursor

# Cursor → Settings → Models → Override OpenAI Base URL
Base URL:  https://api.aigateway.sh/v1
API key:   sk-aig-...
Model ID:  openai/gpt-5.4

Use GPT-5.4 in LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="openai/gpt-5.4",
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
    max_completion_tokens=4096,   # not max_tokens
)

Batch API (50% discount)

GPT-5.4 supports OpenAI's batch endpoint — submit up to 50,000 requests in a file, results come back within 24h at half price. Great for overnight data-extraction jobs.

See the batch docs for the workflow.

Benchmarks

MMLU: 91.8%
HumanEval: 94.0%
GSM8K: 98.2% — best-in-class on math

When to use GPT-5.4

You need guaranteed-parseable JSON from an LLM (strict mode). GPT-5.4 is the most reliable model for this today.
Math-heavy or structured-extraction workloads.
Very long context ingestion (400K window) with disciplined output caps.

For pure agentic coding SWE-Bench style, Claude Opus 4.7 still edges it out; see the Opus guide.

Pricing worked example

An extraction task — 4K-token PDF converted to JSON with ~600 tokens of output:

Input: 4,000 × $2.50 / 1M = $0.010
Output: 600 × $15.00 / 1M = $0.009
~$0.019 per document. 50 docs per dollar.
Via Batch API (50% discount): ~$0.0095 per document.