Build a deep-research agent in a weekend — 200 lines, pennies per run

Perplexity taught the market to expect one thing from research agents: ask a hard question, wait 30 seconds, get a cited memo. The mechanic underneath is not complicated — plan, search in parallel, extract, check for contradictions, write.

This example rebuilds exactly that in 200 lines of Python. Kimi K2.6's 256K context means the agent reads every source it fetches without a vector store. The whole pipeline runs for pennies — and your first $5 of Kimi usage is on the signup credit. Swap any step to Opus 4.7 or GPT-5.4 when the stakes rise — the surrounding code does not change.

AIgateway key Kimi K2.6asyncio fan-outA web-search toolMarkdown output

Note

Why do this over Perplexity's API? Ownership. The pipeline runs on your key, hits your corpus (not just the open web), writes memos in your voice, and costs pennies because the planner decides when to stop searching.

Build it in five steps

STEP 01

Plan

One Kimi call turns the user's question into a ranked list of sub-queries. The planner decides fan-out width based on how specific the question is — narrow questions get 3 queries, broad ones get up to 8.

from openai import AsyncOpenAI
import asyncio, json

client = AsyncOpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
MODEL = "moonshot/kimi-k2.6"

PLAN = """You are a research planner. Given a user question, produce 3-8 web search
queries that together would answer it. Reply JSON: {"queries": ["...", ...]}."""

async def plan(question: str) -> list[str]:
    r = await client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "system", "content": PLAN},
                  {"role": "user", "content": question}],
        response_format={"type": "json_object"},
        extra_headers={"x-aig-tag": "research.plan"},
    )
    return json.loads(r.choices[0].message.content)["queries"]

STEP 02

Search in parallel

Every sub-query fires concurrently through a web search tool (swap for Tavily / Serper / your own crawler). The gateway's async dispatch means wall time is the slowest single search, not the sum.

import httpx

async def search_one(q: str) -> list[dict]:
    # Replace with your search provider of choice.
    async with httpx.AsyncClient() as http:
        r = await http.get("https://api.tavily.com/search",
                           params={"q": q, "max_results": 5},
                           headers={"Authorization": "Bearer TAVILY_KEY"})
        return r.json()["results"]

async def fan_out_search(queries: list[str]) -> list[dict]:
    results = await asyncio.gather(*(search_one(q) for q in queries))
    return [{"q": q, "hits": hits} for q, hits in zip(queries, results)]

STEP 03

Read everything (256K context)

Instead of chunking and embedding, Kimi K2.6's 256K context eats every fetched page whole. The extractor pulls 3-5 key claims from each source with inline citation markers.

EXTRACT = """You are a research analyst. Given the sources below, extract 3-5 key
claims with inline citations [1], [2], .... Reply JSON: {"claims": ["...", ...], "citations": [{"n": 1, "url": "...", "title": "..."}, ...]}."""

async def extract(question: str, batch: list[dict]) -> dict:
    sources = "\n\n".join(f"[{i+1}] {h['title']} — {h['url']}\n{h['content']}"
                           for i, h in enumerate(sum((b['hits'] for b in batch), [])))
    r = await client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "system", "content": EXTRACT},
                  {"role": "user", "content": f"QUESTION: {question}\nSOURCES:\n{sources}"}],
        response_format={"type": "json_object"},
        extra_headers={"x-aig-tag": "research.extract"},
    )
    return json.loads(r.choices[0].message.content)

STEP 04

Contradiction check

A second Kimi pass reads the extracted claims and flags pairs that disagree with each other. Rows where two sources contradict get human-readable notes in the memo — "two sources disagree on this point."

CHECK = """You are a fact checker. Given the claims below, return JSON {"contradictions":
[{"a": <int>, "b": <int>, "issue": "<15 words>"}]} — pairs where claim a and b disagree."""

async def check(claims: list[str]) -> list[dict]:
    r = await client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "system", "content": CHECK},
                  {"role": "user", "content": json.dumps({"claims": claims})}],
        response_format={"type": "json_object"},
        extra_headers={"x-aig-tag": "research.check"},
    )
    return json.loads(r.choices[0].message.content)["contradictions"]

STEP 05

Write the memo

A final Kimi call takes the claims, citations, and contradictions and writes a 1,000-word markdown memo in your voice. Total cost for a real run: about $0.04. Total wall time: 12-20 seconds.

WRITE = """You are a senior analyst. Write a 1,000-word memo that answers the user's
question using only the claims/citations provided. Call out contradictions explicitly.
End with a Sources list."""

async def write(question: str, claims: list[str], citations: list[dict], contras: list[dict]) -> str:
    payload = json.dumps({"question": question, "claims": claims,
                          "citations": citations, "contradictions": contras})
    r = await client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "system", "content": WRITE},
                  {"role": "user", "content": payload}],
        extra_headers={"x-aig-tag": "research.memo"},
    )
    return r.choices[0].message.content

async def research(question: str) -> str:
    qs = await plan(question)
    batches = await fan_out_search(qs)
    facts = await extract(question, batches)
    contras = await check(facts["claims"])
    return await write(question, facts["claims"], facts["citations"], contras)

if __name__ == "__main__":
    print(asyncio.run(research("What changed in global GPU supply in Q1 2026?")))

When to swap Kimi for Opus or GPT-5.4

Plan, search, extract, and check run great on Kimi — the workload is structured, the traces are short, the cost is negligible. The only step worth upgrading is the memo writer when quality bar is high: a single Opus 4.7 call for the writing step roughly triples the cost of the run and measurably raises the craft of the prose.

Change one string. `MODEL` stays Kimi for the first four stages; `write()` uses `anthropic/claude-opus-4.7`. Run both side by side with the eval example in this library to decide if the lift is worth it on your questions.

Ground it in your own corpus

The same pipeline works on a private corpus — swap the web search for a file-index lookup. Kimi K2.6's 256K context holds most company wikis whole, so you don't need a vector store for corpora under a few million tokens.

That's the killer combo: a research agent that cites your internal docs by default and falls back to the open web only when the docs don't answer. All on one key, all metered with `x-aig-tag` so you can track how much of the bill is internal vs external research.

# Replace search_one to hit your corpus first, then the web.
async def search_one(q: str) -> list[dict]:
    hits = await internal_corpus.search(q, k=5)
    if len(hits) < 3:
        hits += await web_search(q)
    return hits

FAQ

Why not use a vector store?+

You can, and we have an example that does. But for corpora under roughly 2M tokens, Kimi K2.6's 256K context window means you can read the sources whole — no chunking, no embeddings, no retrieval tuning. Smaller pipeline, fewer failure modes, better citation fidelity.

How much does one run cost?+

A 6-query research run with 4 sources per query averages ~$0.04 when everything runs on Kimi K2.6 — and the Kimi part runs against your $5 signup credit. Upgrade the final memo to Opus 4.7 and the run lands around $0.12.

What web search provider should I use?+

Tavily and Serper both have generous free tiers and reliable snippet quality. For a fully open-source pipeline, stand up your own SearXNG instance. The example swaps the provider in one function.

How do I add structured output (JSON)?+

Every stage already uses `response_format={'type': 'json_object'}` for plan/extract/check. For the memo, switch to a JSON schema when the downstream consumer is another tool — Kimi and Opus both honor the `json_schema` response-format variant.

Can I stream the memo to the UI?+

Yes — set `stream=True` on the `write()` call. The first four stages are short enough that streaming them is noise; streaming only the memo feels like Perplexity in practice.

Can I cache repeated research?+

Yes. Identical plan/extract inputs hit the exact-match cache automatically. For semantic similarity — "what changed in GPU supply" vs "Q1 GPU market update" — turn on semantic caching; it saves 20-40% on repeat questions for most teams.

Is this enough for production?+

For internal research, yes. For customer-facing research (legal, medical, financial), layer in a guardrail pass and an audit log of every source the memo touches — the `x-aig-tag` header is the anchor point; pair it with AIgateway's replay primitive (Enterprise) to reproduce any memo byte-for-byte.

Can I run this offline?+

The orchestration runs locally; the model calls are network. Self-host a small open-weight model (e.g. `meta/llama-3.1-8b-instruct`) if you need fully-offline inference, but expect a measurable drop in memo quality.

READY TO BUILD?

Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.

Get your key →API reference Kimi K2.6 details

More examples

Turn any PDF into structured JSON with one vision call

Drop-in extraction pipeline: upload a PDF/invoice/receipt, let Kimi K2.6 or Opus 4.7 read it whole, enforce your schema with JSON mode, and get machine-readable data in 30 lines. 96% field-level accuracy vs OCR on the benchmark set inside.

Launch a work-automation agent swarm with Hermes + Kimi K2.6

Spin up a five-agent swarm that runs locally on Hermes, routes every call through AIgateway, and automates your inbox, calendar, research, coding, and reporting — Kimi K2.6 handles the reasoning on your $5 signup credit.

Run an eval across 5 frontier models on your own data in 10 minutes

Send the same 50-row dataset to Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1, and Llama 4.1 in parallel through one AIgateway key, grade every response with an LLM judge, and publish a scorecard — 40 lines of Python, no eval framework required.