Turn any PDF into structured JSON with one vision call

Every company with customers has a document-extraction problem. Invoices, receipts, packing slips, loan applications, KYC forms. The legacy stack is an OCR engine plus a pile of regex plus someone on duty to fix broken layouts.

A vision-capable LLM with a schema you trust replaces that stack. Upload the document, give the model the schema you want, and ask for JSON. Accuracy on a typical invoice benchmark runs 94-97% at field level, competitive with commercial OCR on structured docs and well ahead on messy scans, receipts, and handwriting.

AIgateway key Kimi K2.6 · $0.60/MtokOpus 4.7 · when stakes are highpypdf · page → imagejson_schema response format

Note

Per-page cost on Kimi K2.6: about $0.003 — roughly 15× cheaper than a traditional OCR + cleanup pipeline for most invoice workloads, and the Kimi column runs against your $5 signup credit.

Build it in four steps

STEP 01

Render the PDF to images

Vision LLMs take images. A PDF is images wrapped in a container — split each page to a base64 PNG. pypdf + Pillow handles it in two lines; if you're in Node, use pdfjs.

from pypdf import PdfReader
from PIL import Image
import io, base64

def pdf_to_images(path: str) -> list[str]:
    imgs = []
    for page in PdfReader(path).pages:
        # Render with any PDF renderer of your choice; example uses pypdfium2.
        import pypdfium2 as pdfium
        pil = pdfium.PdfDocument(path)[page.page_number].render(scale=2).to_pil()
        buf = io.BytesIO(); pil.save(buf, format="PNG")
        imgs.append(base64.b64encode(buf.getvalue()).decode())
    return imgs

STEP 02

Define the schema you want

Write the output shape as JSON Schema. The model is constrained at decode time — you cannot get a response that doesn't match. Required fields are required, enum values are enforced, numbers are numbers.

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "vendor":     {"type": "string"},
        "invoice_no": {"type": "string"},
        "issue_date": {"type": "string", "format": "date"},
        "due_date":   {"type": "string", "format": "date"},
        "currency":   {"type": "string", "enum": ["USD", "EUR", "INR", "GBP"]},
        "line_items": {"type": "array", "items": {
            "type": "object",
            "properties": {
                "sku":  {"type": "string"},
                "qty":  {"type": "number"},
                "unit": {"type": "number"},
            },
            "required": ["sku", "qty", "unit"],
        }},
        "subtotal":   {"type": "number"},
        "tax":        {"type": "number"},
        "total":      {"type": "number"},
    },
    "required": ["vendor", "invoice_no", "issue_date", "total"],
}

STEP 03

One vision call, schema-locked

Send every page as an image part in a single request (Kimi K2.6's 256K context + 100-image limit handles real invoices whole). Ask for JSON matching the schema. Done.

from openai import OpenAI
import json

client = OpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")

def extract_invoice(pdf_path: str) -> dict:
    pages = [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
        for b64 in pdf_to_images(pdf_path)
    ]
    resp = client.chat.completions.create(
        model="moonshot/kimi-k2.6",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract this invoice into the required JSON schema."},
                *pages,
            ],
        }],
        response_format={"type": "json_schema", "json_schema": {
            "name": "invoice", "schema": INVOICE_SCHEMA, "strict": True
        }},
        extra_headers={"x-aig-tag": "docs.invoice"},
    )
    return json.loads(resp.choices[0].message.content)

STEP 04

Validate and ship

JSON mode guarantees shape but not truth. Run lightweight post-checks — subtotal + tax == total, line-item totals sum, date in valid range. The validator catches the 2-3% the model misreads on blurry pages.

def validate(doc: dict) -> list[str]:
    errs = []
    line_total = sum(li["qty"] * li["unit"] for li in doc["line_items"])
    if abs(line_total - doc["subtotal"]) > 0.01:
        errs.append("line items don't sum to subtotal")
    if abs(doc["subtotal"] + doc["tax"] - doc["total"]) > 0.01:
        errs.append("subtotal + tax != total")
    return errs

doc = extract_invoice("invoice.pdf")
errs = validate(doc)
if errs:
    # Optionally re-run on Opus 4.7 for a second opinion.
    doc_v2 = extract_invoice_with_model("anthropic/claude-opus-4.7", "invoice.pdf")

When to route to Opus

Kimi K2.6 handles clean printed invoices, packing slips, and modern forms at 94-97% field accuracy. Where it loses ground — heavily skewed scans, handwriting-dense forms, photos of receipts in bad lighting — Opus 4.7 typically recovers another 2-3 points and catches edge cases.

The pragmatic routing pattern: run Kimi first, validate, and re-run on Opus only for docs that fail validation. That keeps average cost near Kimi while the worst docs get the careful reader.

Beyond invoices

The same pattern — images in, JSON Schema out — extends to every document domain. Receipts for expense reports. Resumes for ATS ingestion. Loan applications for underwriting. Technical drawings for BOM extraction. Swap the schema; everything else is identical.

For truly adversarial documents (security clearance forms, medical records with abbreviations), chain an evaluator: have the model extract twice with different prompts and diff the outputs. Disagreements flag rows for human review.

# Re-use the pipeline with a different schema.
RESUME_SCHEMA = { "type": "object", "properties": {
    "name": {...}, "email": {...}, "experience": {...}, "skills": {...}
}}
resume = extract_document("candidate.pdf", RESUME_SCHEMA)

FAQ

How does this compare to traditional OCR?+

On clean printed documents the field-level accuracy is within a point of AWS Textract and Google Document AI. On messy scans, receipts, handwriting, and photos in bad lighting, vision LLMs pull clearly ahead — they read context, not just glyphs. They also need no template setup; a new vendor's invoice works immediately.

What's the cost per page?+

Kimi K2.6: roughly $0.003/page (your $5 signup credit covers thousands of pages). Opus 4.7: roughly $0.04/page. GPT-5.4 Vision: roughly $0.025/page. Multi-page docs get batched in one request — the per-doc overhead is paid once, not per page.

How does JSON mode stay reliable on messy docs?+

The schema is enforced at decode time — the model literally cannot emit a malformed structure. What you can still get wrong is field values (wrong number, missing line item). That's what the light validator step catches.

Is there a page-count limit?+

100 images per call on Kimi K2.6 and Opus; 50 on GPT-5.4 Vision. For longer docs, batch by chunks of 100 pages and merge results — or switch to the async jobs API for single-pass ingestion of 1000+ page PDFs.

Can I redact PII before extraction?+

Yes — AIgateway's guardrails primitive (Enterprise) runs a PII redaction pass before the model sees the document. On free/Pro tiers, ship the PDF through a local redactor first, then through the pipeline.

What about table extraction?+

Works well. For tables with merged cells or multi-line rows, phrase the schema as `rows: [[cell1, cell2, ...]]` and give the model an example in the prompt. For heavy-table use (financial statements), Opus 4.7 outperforms Kimi by a clear margin.

Can I run this at scale?+

Yes. One `x-aig-tag` per pipeline, one hard cap on the tag, and every document becomes a single billed unit with per-feature analytics. The gateway's async jobs API is the right surface for batch runs over 10k docs.

READY TO BUILD?

Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.

Get your key →API reference Kimi K2.6 details

More examples

Real-time voice agent in 80 lines (STT → LLM → TTS)

Three modalities, one key: Deepgram streaming transcription feeds Kimi K2.6, which streams into ElevenLabs — voice-to-voice with p50 latency under 650ms. The LLM step runs against your $5 signup credit.

Run an eval across 5 frontier models on your own data in 10 minutes

Send the same 50-row dataset to Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1, and Llama 4.1 in parallel through one AIgateway key, grade every response with an LLM judge, and publish a scorecard — 40 lines of Python, no eval framework required.

Cut your LLM bill 70% — a $4,800 → $1,420 case study

A real 30-day rebuild: semantic cache, complexity-based routing, and per-feature hard caps took one team's GPT-only monthly spend from $4,800 to $1,420 without touching product code. Every lever is a header, a config, or a one-line POST.