Run an eval across 5 frontier models on your own data in 10 minutes

Every team ends up in the same argument: which model is best for our workload? The honest answer is "the one that wins on your data" — and the fastest way to find that out is a parallel eval.

This example runs the same 50-row dataset through Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1, and Llama 4.1 — all at once, through one AIgateway key — then grades every response with an LLM judge and writes a CSV scorecard. No eval framework. No second account. No rate-limit management. Forty lines of Python, ten minutes of wall time.

Kimi K2.6 is on the $5 signup-credit shortlist, so the Kimi column runs against your signup credit; the other four are pass-through. A typical 50-row five-model eval lands under twenty cents.

AIgateway keyPython 3.11+openai SDKasyncioLLM judge · Opus 4.7

Note

Why eval at all? Most teams burn six-figure model budgets running whatever shipped first. A 50-row eval on your real data is the cheapest way to find out you've been paying 3× for a model that's tied or losing on your prompts.

Build it in four steps

STEP 01
Bring your dataset
A CSV with two columns: `input` (the user prompt) and `reference` (the ideal answer, if you have one — otherwise the judge scores on rubric alone). Fifty rows is usually enough to see a gap.
```
# dataset.csv
# input,reference
# "Summarize this earnings call in 80 words: ...","<ideal summary>"
# "Translate to Japanese preserving tone: ...","<ideal translation>"
# ...
```

STEP 02

Dispatch to five models in parallel

One async loop, five tasks per row. The OpenAI SDK points at AIgateway; the only thing that changes between tasks is the model slug.

import asyncio, csv
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

MODELS = [
    "anthropic/claude-opus-4.7",
    "openai/gpt-5.4",
    "moonshot/kimi-k2.6",
    "google/gemini-3.1-pro",
    "meta/llama-4.1-405b",
]

async def run_one(model: str, prompt: str) -> str:
    r = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_headers={"x-aig-tag": f"eval.{model.split('/')[-1]}"},
    )
    return r.choices[0].message.content or ""

async def eval_row(row: dict) -> dict:
    outs = await asyncio.gather(*(run_one(m, row["input"]) for m in MODELS))
    return {"input": row["input"], "reference": row.get("reference", ""), **dict(zip(MODELS, outs))}

STEP 03

Score with an LLM judge

A small prompt asks Opus to score each candidate 0–100 against the reference (or against a rubric if you have none). One judge call per candidate per row — parallelized the same way.

JUDGE_PROMPT = """You are a strict evaluator. Score the CANDIDATE 0-100 against the REFERENCE.
Criteria: factual accuracy, completeness, style match. Reply JSON only: {"score": <int>, "reason": "<15 words>"}.
INPUT: {input}
REFERENCE: {reference}
CANDIDATE: {candidate}"""

async def judge(row: dict, model: str) -> dict:
    r = await client.chat.completions.create(
        model="anthropic/claude-opus-4.7",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            input=row["input"], reference=row["reference"], candidate=row[model]
        )}],
        response_format={"type": "json_object"},
        extra_headers={"x-aig-tag": "eval.judge"},
    )
    import json
    return json.loads(r.choices[0].message.content)

STEP 04

Write the scorecard

Average the per-row scores per model. Sort. Print the winner. That's the whole eval — under 40 lines and cheap enough to run before every prompt change.

async def main():
    rows = list(csv.DictReader(open("dataset.csv")))
    evaluated = await asyncio.gather(*(eval_row(r) for r in rows))

    scores = {m: [] for m in MODELS}
    for row in evaluated:
        verdicts = await asyncio.gather(*(judge(row, m) for m in MODELS))
        for m, v in zip(MODELS, verdicts):
            scores[m].append(v["score"])

    board = sorted(((sum(s)/len(s), m) for m, s in scores.items()), reverse=True)
    for avg, m in board:
        print(f"{avg:6.1f}  {m}")

asyncio.run(main())

# => 84.3  moonshot/kimi-k2.6
#    82.1  anthropic/claude-opus-4.7
#    79.8  openai/gpt-5.4
#    77.4  google/gemini-3.1-pro
#    72.6  meta/llama-4.1-405b

Why you couldn't do this on OpenAI alone

One endpoint, one key, five providers. On raw provider SDKs you'd be managing five auth flows, five response shapes, five rate limits, and five invoices. Here the only thing that varies between calls is the model slug.

Because every call carries an `x-aig-tag`, the scorecard and the bill are the same data. After the eval finishes, `GET /v1/usage/by-tag?month=2026-04` shows you exactly what each model cost to evaluate — useful when you're trying to explain the five-minute experiment to finance.

Scale it without editing code

Add a sixth model? Append a slug to the `MODELS` list. Swap judges? Change one string. Broaden the eval to 5,000 rows? Move `dataset.csv` to a bigger file — the concurrency model already handles it because every call is a coroutine on the same gateway.

For production evals, point `x-aig-tag` at a per-run tag (`eval.2026-04-25`) and set a hard cap so a runaway judge loop can't bankrupt the experiment.

curl -X POST https://api.aigateway.sh/v1/budgets \
  -H "Authorization: Bearer sk-aig-..." \
  -d '{ "tag": "eval.2026-04-25", "monthly_cap_cents": 500 }'

FAQ

Do I need a different key per model?+

No. One `sk-aig-…` key from AIgateway reaches every model in the catalog. The only thing that varies between calls is the model slug.

Is LLM-judge scoring reliable enough for real decisions?+

For A/B between models on the same task, yes — Opus as a judge agrees with human raters 85–92% of the time when given a reference and a rubric. For absolute benchmarks, add a second judge (e.g., GPT-5.4) and average; disagreement between judges flags rows worth a human look.

What if I don't have a reference column?+

Drop the reference and give the judge a rubric instead — "score 0-100 on factual accuracy, completeness, and brand voice." Scores will be noisier but still useful for ranking.

How much does the eval cost?+

Fifty rows × five models × ~500 output tokens + 50 judge calls lands around $0.12–$0.18 — and the Kimi column runs against your $5 signup credit. Double the rows, double the cost.

Can I cache repeated rows across runs?+

Yes — AIgateway hashes the full request and returns the prior response on exact match. A small-code-change eval re-run hits the cache for every row that didn't change, so you only pay for the new ones.

Can I swap the judge model?+

Change one string. Run the same eval judged by Kimi K2.6 and by Opus and compare — when judges agree you have a strong signal, when they disagree you've found the ambiguous rows that need a human.

Is this a replacement for a real eval framework?+

For internal model-swap decisions, yes. For production-grade regression testing (golden sets, statistical significance, drift detection), pair this with a framework like promptfoo or Braintrust — point either at AIgateway and all five providers appear as one.

READY TO BUILD?

Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.

Get your key →API reference Kimi K2.6 details