Send the same 50-row dataset to Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1, and Llama 4.1 in parallel through one AIgateway key, grade every response with an LLM judge, and publish a scorecard — 40 lines of Python, no eval framework required.
Every team ends up in the same argument: which model is best for our workload? The honest answer is "the one that wins on your data" — and the fastest way to find that out is a parallel eval.
This example runs the same 50-row dataset through Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1, and Llama 4.1 — all at once, through one AIgateway key — then grades every response with an LLM judge and writes a CSV scorecard. No eval framework. No second account. No rate-limit management. Forty lines of Python, ten minutes of wall time.
On the free Kimi tier (through Apr 30, 2026) the Kimi column is free; the other four are pass-through. A typical 50-row five-model eval lands under twenty cents.
A CSV with two columns: `input` (the user prompt) and `reference` (the ideal answer, if you have one — otherwise the judge scores on rubric alone). Fifty rows is usually enough to see a gap.
# dataset.csv # input,reference # "Summarize this earnings call in 80 words: ...","<ideal summary>" # "Translate to Japanese preserving tone: ...","<ideal translation>" # ...
One async loop, five tasks per row. The OpenAI SDK points at AIgateway; the only thing that changes between tasks is the model slug.
import asyncio, csv
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://api.aigateway.sh/v1",
api_key="sk-aig-...",
)
MODELS = [
"anthropic/claude-opus-4.7",
"openai/gpt-5.4",
"moonshot/kimi-k2.6",
"google/gemini-3.1-pro",
"meta/llama-4.1-405b",
]
async def run_one(model: str, prompt: str) -> str:
r = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
extra_headers={"x-aig-tag": f"eval.{model.split('/')[-1]}"},
)
return r.choices[0].message.content or ""
async def eval_row(row: dict) -> dict:
outs = await asyncio.gather(*(run_one(m, row["input"]) for m in MODELS))
return {"input": row["input"], "reference": row.get("reference", ""), **dict(zip(MODELS, outs))}A small prompt asks Opus to score each candidate 0–100 against the reference (or against a rubric if you have none). One judge call per candidate per row — parallelized the same way.
JUDGE_PROMPT = """You are a strict evaluator. Score the CANDIDATE 0-100 against the REFERENCE.
Criteria: factual accuracy, completeness, style match. Reply JSON only: {"score": <int>, "reason": "<15 words>"}.
INPUT: {input}
REFERENCE: {reference}
CANDIDATE: {candidate}"""
async def judge(row: dict, model: str) -> dict:
r = await client.chat.completions.create(
model="anthropic/claude-opus-4.7",
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
input=row["input"], reference=row["reference"], candidate=row[model]
)}],
response_format={"type": "json_object"},
extra_headers={"x-aig-tag": "eval.judge"},
)
import json
return json.loads(r.choices[0].message.content)Average the per-row scores per model. Sort. Print the winner. That's the whole eval — under 40 lines and cheap enough to run before every prompt change.
async def main():
rows = list(csv.DictReader(open("dataset.csv")))
evaluated = await asyncio.gather(*(eval_row(r) for r in rows))
scores = {m: [] for m in MODELS}
for row in evaluated:
verdicts = await asyncio.gather(*(judge(row, m) for m in MODELS))
for m, v in zip(MODELS, verdicts):
scores[m].append(v["score"])
board = sorted(((sum(s)/len(s), m) for m, s in scores.items()), reverse=True)
for avg, m in board:
print(f"{avg:6.1f} {m}")
asyncio.run(main())
# => 84.3 moonshot/kimi-k2.6
# 82.1 anthropic/claude-opus-4.7
# 79.8 openai/gpt-5.4
# 77.4 google/gemini-3.1-pro
# 72.6 meta/llama-4.1-405bOne endpoint, one key, five providers. On raw provider SDKs you'd be managing five auth flows, five response shapes, five rate limits, and five invoices. Here the only thing that varies between calls is the model slug.
Because every call carries an `x-aig-tag`, the scorecard and the bill are the same data. After the eval finishes, `GET /v1/usage/by-tag?month=2026-04` shows you exactly what each model cost to evaluate — useful when you're trying to explain the five-minute experiment to finance.
Add a sixth model? Append a slug to the `MODELS` list. Swap judges? Change one string. Broaden the eval to 5,000 rows? Move `dataset.csv` to a bigger file — the concurrency model already handles it because every call is a coroutine on the same gateway.
For production evals, point `x-aig-tag` at a per-run tag (`eval.2026-04-25`) and set a hard cap so a runaway judge loop can't bankrupt the experiment.
curl -X POST https://api.aigateway.sh/v1/budgets \
-H "Authorization: Bearer sk-aig-..." \
-d '{ "tag": "eval.2026-04-25", "monthly_cap_cents": 500 }'No. One `sk-aig-…` key from AIgateway reaches every model in the catalog. The only thing that varies between calls is the model slug.
For A/B between models on the same task, yes — Opus as a judge agrees with human raters 85–92% of the time when given a reference and a rubric. For absolute benchmarks, add a second judge (e.g., GPT-5.4) and average; disagreement between judges flags rows worth a human look.
Drop the reference and give the judge a rubric instead — "score 0-100 on factual accuracy, completeness, and brand voice." Scores will be noisier but still useful for ranking.
Fifty rows × five models × ~500 output tokens + 50 judge calls lands around $0.12–$0.18 — and the Kimi column is free on AIgateway through Apr 30, 2026. Double the rows, double the cost.
Yes — AIgateway hashes the full request and returns the prior response on exact match. A small-code-change eval re-run hits the cache for every row that didn't change, so you only pay for the new ones.
Change one string. Run the same eval judged by Kimi K2.6 and by Opus and compare — when judges agree you have a strong signal, when they disagree you've found the ambiguous rows that need a human.
For internal model-swap decisions, yes. For production-grade regression testing (golden sets, statistical significance, drift detection), pair this with a framework like promptfoo or Braintrust — point either at AIgateway and all five providers appear as one.