Leaderboard · 2026

AI Model Leaderboard — 2026

Every model you can call through AIgateway, ranked by the benchmarks engineers actually care about: SWE-bench for agentic coding, MMLU + GPQA for reasoning, HumanEval for code generation — plus input and output price per 1M tokens and context window size. Numbers are sourced from each provider's reported eval results; blank cells mean the lab has not published that benchmark. Sort any column. The same data is available as JSON at /api/leaderboard.

964 models across 76 providers. Every row is a click away from a full model page with pricing, quickstart code, and a live playground.

70 of 964 models · JSON
ModelProviderSWE-benchMMLUGPQAHumanEvalInput $/MOutput $/MContext
Claude Opus 4.8
anthropic/claude-opus-4.8
Anthropic$5.00$25.001M
MiniMax M3
minimax/m3
MiniMax$0.30$1.201M
Gemini 3.1 Pro
google/gemini-3.1-pro
Google89.6$2.00$12.001M
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
Anthropic62.187.293.4$3.00$15.00200K
GPT-5.5
openai/gpt-5.5
OpenAI$5.00$30.001M
O3
openai/o3
OpenAI$2.00$8.00200K
Gemini 2.5 Pro
google/gemini-2.5-pro
Google$1.25$10.001M
Kimi-K2.6
moonshot/kimi-k2.6
Moonshot68.292.7$0.95$4.00262K
Qwen 3.5 397B A17B
alibaba/qwen3.5-397b-a17b
Alibaba$0.60$3.60262K
Grok 4.20 Multi-Agent
xai/grok-4.20-multi-agent-0309
xAI$2.00$6.002M
GPT-5.4 Mini
openai/gpt-5.4-mini
OpenAI84.588.6$0.75$4.50128K
GPT-5.5 Pro
openai/gpt-5.5-pro
OpenAI$30.00$180.001M
Gemma-4-26b-A4b-IT
google/gemma-4-26b-a4b-it
Google$0.34$0.56131K
Claude Haiku 4.5
anthropic/claude-haiku-4.5
Anthropic80.185.2$1.00$5.00200K
O4-Mini
openai/o4-mini
OpenAI$1.10$4.40200K
Gpt-Oss-120b
openai/gpt-oss-120b
OpenAI$0.35$0.75131K
Sonar Reasoning Pro
perplexity/sonar-reasoning-pro
Perplexity$2.00$8.00127K
Nemotron-3-120b-A12b
nvidia/nemotron-3-120b-a12b
Nvidia$0.50$1.20131K
Sonar Deep Research
perplexity/sonar-deep-research
Perplexity$2.00$8.00127K
Gemma-2b-IT-Lora
google/gemma-2b-it-lora
Google$0.030$0.0604K
Llama-3.2-3b-Instruct
meta/llama-3.2-3b-instruct
Meta$0.030$0.060128K
Mistral-7b-Instruct-V0.2-Lora
mistral/mistral-7b-instruct-v0.2-lora
Mistral$0.050$0.104K
Llama-3.1-8b-Instruct-Fp8
meta/llama-3.1-8b-instruct-fp8
Meta$0.050$0.10131K
Llama-3.2-1b-Instruct
meta/llama-3.2-1b-instruct
Meta$0.015$0.030128K
Glm-4.7-Flash
zai-org/glm-4.7-flash
Zai-org$0.050$0.10131K
Llama-2-7b-Chat-HF-Lora
meta-llama/llama-2-7b-chat-hf-lora
Meta-llama$0.040$0.0804K
Llama-3.3-70b-Instruct-Fp8-Fast
meta/llama-3.3-70b-instruct-fp8-fast
Meta$0.29$2.25131K
Granite-4.0-H-Micro
ibm-granite/granite-4.0-h-micro
Ibm-granite$0.020$0.118K
Qwen2.5-Coder-32b-Instruct
qwen/qwen2.5-coder-32b-instruct
Alibaba Qwen$0.66$1.00131K
Gemma-Sea-Lion-V4-27b-IT
aisingapore/gemma-sea-lion-v4-27b-it
AI Singapore$0.30$0.504K
Qwen3-30b-A3b-Fp8
qwen/qwen3-30b-a3b-fp8
Alibaba Qwen$0.25$0.50131K
Gemma-7b-IT-Lora
google/gemma-7b-it-lora
Google$0.080$0.164K
Mistral-Small-3.1-24b-Instruct
mistralai/mistral-small-3.1-24b-instruct
Mistral$0.35$0.55131K
Gpt-Oss-20b
openai/gpt-oss-20b
OpenAI$0.20$0.30131K
Llama-4-Scout-17b-16e-Instruct
meta/llama-4-scout-17b-16e-instruct
Meta$0.27$0.85131K
Grok 4 Fast
xai/grok-4-fast
xAI$0.50$2.00256K
Grok 4
xai/grok-4
xAI$5.00$15.00256K
Claude Opus 4.5
anthropic/claude-opus-4.5
Anthropic$5.00$25.00200K
Claude Opus 4.6
anthropic/claude-opus-4.6
Anthropic$5.00$25.001M
Claude Opus 4.7
anthropic/claude-opus-4.7
Anthropic72.590.495.1$5.00$25.001M
Claude Sonnet 4
anthropic/claude-sonnet-4
Anthropic$3.00$15.00200K
Claude Sonnet 4.5
anthropic/claude-sonnet-4.5
Anthropic$3.00$15.00200K
GPT-4.1
openai/gpt-4.1
OpenAI$2.00$8.001.0M
GPT-4.1 Mini
openai/gpt-4.1-mini
OpenAI$0.40$1.601.0M
GPT-4.1 Nano
openai/gpt-4.1-nano
OpenAI$0.10$0.401M
GPT-4o
openai/gpt-4o
OpenAI$2.50$10.00128K
GPT-4o Mini
openai/gpt-4o-mini
OpenAI$0.15$0.60128K
GPT-5
openai/gpt-5
OpenAI$1.25$10.00128K
GPT-5 Chat
openai/gpt-5-chat
OpenAI$1.25$10.00128K
GPT-5 Mini
openai/gpt-5-mini
OpenAI$0.25$2.00128K
GPT-5 Nano
openai/gpt-5-nano
OpenAI$0.050$0.40128K
GPT-5.1
openai/gpt-5.1
OpenAI$1.25$10.00128K
GPT-5.1 Chat
openai/gpt-5.1-chat
OpenAI$1.25$10.00128K
GPT-5.4
openai/gpt-5.4
OpenAI91.894.0$2.50$15.001M
GPT-5.4 Nano
openai/gpt-5.4-nano
OpenAI$0.20$1.25128K
GPT-5.4 Pro
openai/gpt-5.4-pro
OpenAI$30.00$180.001M
Gemini 2.5 Flash
google/gemini-2.5-flash
Google$0.30$2.501M
Gemini 2.5 Flash Lite
google/gemini-2.5-flash-lite
Google$0.10$0.401M
Gemini 3 Flash
google/gemini-3-flash
Google82.3$0.50$3.001M
Gemini 3.1 Flash Lite
google/gemini-3.1-flash-lite
Google$0.25$1.501M
Grok 4.20 Non-Reasoning
xai/grok-4.20-0309-non-reasoning
xAI$2.00$6.002M
Grok 4.20 Reasoning
xai/grok-4.20-0309-reasoning
xAI$2.00$6.002M
Grok 4.3
xai/grok-4.3
xAI$1.25$2.501M
MiniMax M2.7
minimax/m2.7
MiniMax$0.30$1.20128K
Qwen 3 Max
alibaba/qwen3-max
Alibaba$1.20$6.00262K
O3-Mini
openai/o3-mini
OpenAI$1.10$4.40200K
IndicTrans2 EN→Indic 1B
ai4bharat/indictrans2-en-indic-1B
AI4Bharat$0.021$0.042
BART Large CNN
facebook/bart-large-cnn
Meta$0.050$0.10
DistilBERT SST-2
huggingface/distilbert-sst-2-int8
Hugging Face
M2M100 1.2B
meta/m2m100-1.2b
Meta$0.021$0.042
By category

Shortcuts by what you're actually optimizing for.

Five opinionated slices: coding agents, reasoning-heavy work, cheapest model with tool calling, longest context, and fastest text models. Each slice cites the underlying benchmark or pricing field directly.

Best at coding · SWE-bench

ModelProviderSWE-bench
Claude Opus 4.7
anthropic/claude-opus-4.7
Anthropic72.5
Kimi-K2.6
moonshot/kimi-k2.6
Moonshot68.2
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
Anthropic62.1

Best reasoning · GPQA / MMLU

ModelProviderScore
GPT-5.4
openai/gpt-5.4
OpenAI91.8 MMLU
Claude Opus 4.7
anthropic/claude-opus-4.7
Anthropic90.4 MMLU
Gemini 3.1 Pro
google/gemini-3.1-pro
Google89.6 MMLU
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
Anthropic87.2 MMLU
GPT-5.4 Mini
openai/gpt-5.4-mini
OpenAI84.5 MMLU
Gemini 3 Flash
google/gemini-3-flash
Google82.3 MMLU
Claude Haiku 4.5
anthropic/claude-haiku-4.5
Anthropic80.1 MMLU

Cheapest with tool calling

ModelProviderIn + Out /M
Granite-4.0-H-Micro
ibm-granite/granite-4.0-h-micro
Ibm-granite$0.13
GPT-5 Nano
openai/gpt-5-nano
OpenAI$0.45
Gpt-Oss-20b
openai/gpt-oss-20b
OpenAI$0.50
GPT-4.1 Nano
openai/gpt-4.1-nano
OpenAI$0.50
Mistral-Small-3.1-24b-Instruct
mistralai/mistral-small-3.1-24b-instruct
Mistral$0.90
Gpt-Oss-120b
openai/gpt-oss-120b
OpenAI$1.10
Llama-4-Scout-17b-16e-Instruct
meta/llama-4-scout-17b-16e-instruct
Meta$1.12
GPT-5.4 Nano
openai/gpt-5.4-nano
OpenAI$1.45

Longest context window

ModelProviderContext
Grok 4.20 Multi-Agent
xai/grok-4.20-multi-agent-0309
xAI2M
Grok 4.20 Non-Reasoning
xai/grok-4.20-0309-non-reasoning
xAI2M
Grok 4.20 Reasoning
xai/grok-4.20-0309-reasoning
xAI2M
GPT-4.1
openai/gpt-4.1
OpenAI1.0M
GPT-4.1 Mini
openai/gpt-4.1-mini
OpenAI1.0M
Claude Opus 4.6
anthropic/claude-opus-4.6
Anthropic1M
Claude Opus 4.7
anthropic/claude-opus-4.7
Anthropic1M
Claude Opus 4.8
anthropic/claude-opus-4.8
Anthropic1M

Fastest · edge + mini tier

ModelProviderTier
Llama-3.3-70b-Instruct-Fp8-Fast
meta/llama-3.3-70b-instruct-fp8-fast
Metaedge
Llama-4-Scout-17b-16e-Instruct
meta/llama-4-scout-17b-16e-instruct
Metaedge
Grok 4 Fast
xai/grok-4-fast
xAIedge
Claude Haiku 4.5
anthropic/claude-haiku-4.5
Anthropicedge
GPT-4.1 Mini
openai/gpt-4.1-mini
OpenAIedge
GPT-4.1 Nano
openai/gpt-4.1-nano
OpenAIedge
GPT-4o Mini
openai/gpt-4o-mini
OpenAIedge
GPT-5 Mini
openai/gpt-5-mini
OpenAIedge
How this is built

Public benchmarks. Live pricing. Free JSON.

Benchmarks: sourced from each lab's published eval card (Anthropic system cards, OpenAI release notes, Google's technical reports, Moonshot's Kimi paper, Meta's Llama reports). Missing cells mean the lab hasn't disclosed that benchmark — not that the model failed it.

Pricing: the exact pass-through rate you pay on AIgateway. A 5% platform fee is applied at top-up, not per request, so the $/M tokens here is what the provider charges — nothing added.

Updates: the catalog updates on every release, so this page moves when a new frontier model lands. The JSON endpoint is at GET /api/leaderboard with a one-hour browser cache — embed it on your own comparison post.

Want to run any of these? Every row links to a model page with quickstart code. Or try the cost calculator to project spend across candidate models for your workload.