examples/multi-modal
Strong tier · 8 min build

Turn any PDF into structured JSON with one vision call

Drop-in extraction pipeline: upload a PDF/invoice/receipt, let Kimi K2.6 or Opus 4.7 read it whole, enforce your schema with JSON mode, and get machine-readable data in 30 lines. 96% field-level accuracy vs OCR on the benchmark set inside.

8 min readpublished 2026-04-25category · Multi-modal
PDF invoice on the left extracted into clean JSON on the right with a single vision LLM call

Every company with customers has a document-extraction problem. Invoices, receipts, packing slips, loan applications, KYC forms. The legacy stack is an OCR engine plus a pile of regex plus someone on duty to fix broken layouts.

A vision-capable LLM with a schema you trust replaces that stack. Upload the document, give the model the schema you want, and ask for JSON. Accuracy on a typical invoice benchmark runs 94-97% at field level, competitive with commercial OCR on structured docs and well ahead on messy scans, receipts, and handwriting.

AIgateway keyKimi K2.6 · $0.60/MtokOpus 4.7 · when stakes are highpypdf · page → imagejson_schema response format
Note
Per-page cost on Kimi K2.6: about $0.003 — roughly 15× cheaper than a traditional OCR + cleanup pipeline for most invoice workloads, and the Kimi column runs against your $5 signup credit.

Build it in four steps

  1. STEP 01

    Render the PDF to images

    Vision LLMs take images. A PDF is images wrapped in a container — split each page to a base64 PNG. pypdf + Pillow handles it in two lines; if you're in Node, use pdfjs.

    from pypdf import PdfReader
    from PIL import Image
    import io, base64
    
    def pdf_to_images(path: str) -> list[str]:
        imgs = []
        for page in PdfReader(path).pages:
            # Render with any PDF renderer of your choice; example uses pypdfium2.
            import pypdfium2 as pdfium
            pil = pdfium.PdfDocument(path)[page.page_number].render(scale=2).to_pil()
            buf = io.BytesIO(); pil.save(buf, format="PNG")
            imgs.append(base64.b64encode(buf.getvalue()).decode())
        return imgs
  2. STEP 02

    Define the schema you want

    Write the output shape as JSON Schema. The model is constrained at decode time — you cannot get a response that doesn't match. Required fields are required, enum values are enforced, numbers are numbers.

    INVOICE_SCHEMA = {
        "type": "object",
        "properties": {
            "vendor":     {"type": "string"},
            "invoice_no": {"type": "string"},
            "issue_date": {"type": "string", "format": "date"},
            "due_date":   {"type": "string", "format": "date"},
            "currency":   {"type": "string", "enum": ["USD", "EUR", "INR", "GBP"]},
            "line_items": {"type": "array", "items": {
                "type": "object",
                "properties": {
                    "sku":  {"type": "string"},
                    "qty":  {"type": "number"},
                    "unit": {"type": "number"},
                },
                "required": ["sku", "qty", "unit"],
            }},
            "subtotal":   {"type": "number"},
            "tax":        {"type": "number"},
            "total":      {"type": "number"},
        },
        "required": ["vendor", "invoice_no", "issue_date", "total"],
    }
  3. STEP 03

    One vision call, schema-locked

    Send every page as an image part in a single request (Kimi K2.6's 256K context + 100-image limit handles real invoices whole). Ask for JSON matching the schema. Done.

    from openai import OpenAI
    import json
    
    client = OpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
    
    def extract_invoice(pdf_path: str) -> dict:
        pages = [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
            for b64 in pdf_to_images(pdf_path)
        ]
        resp = client.chat.completions.create(
            model="moonshot/kimi-k2.6",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract this invoice into the required JSON schema."},
                    *pages,
                ],
            }],
            response_format={"type": "json_schema", "json_schema": {
                "name": "invoice", "schema": INVOICE_SCHEMA, "strict": True
            }},
            extra_headers={"x-aig-tag": "docs.invoice"},
        )
        return json.loads(resp.choices[0].message.content)
  4. STEP 04

    Validate and ship

    JSON mode guarantees shape but not truth. Run lightweight post-checks — subtotal + tax == total, line-item totals sum, date in valid range. The validator catches the 2-3% the model misreads on blurry pages.

    def validate(doc: dict) -> list[str]:
        errs = []
        line_total = sum(li["qty"] * li["unit"] for li in doc["line_items"])
        if abs(line_total - doc["subtotal"]) > 0.01:
            errs.append("line items don't sum to subtotal")
        if abs(doc["subtotal"] + doc["tax"] - doc["total"]) > 0.01:
            errs.append("subtotal + tax != total")
        return errs
    
    doc = extract_invoice("invoice.pdf")
    errs = validate(doc)
    if errs:
        # Optionally re-run on Opus 4.7 for a second opinion.
        doc_v2 = extract_invoice_with_model("anthropic/claude-opus-4.7", "invoice.pdf")
    

When to route to Opus

Kimi K2.6 handles clean printed invoices, packing slips, and modern forms at 94-97% field accuracy. Where it loses ground — heavily skewed scans, handwriting-dense forms, photos of receipts in bad lighting — Opus 4.7 typically recovers another 2-3 points and catches edge cases.

The pragmatic routing pattern: run Kimi first, validate, and re-run on Opus only for docs that fail validation. That keeps average cost near Kimi while the worst docs get the careful reader.

Beyond invoices

The same pattern — images in, JSON Schema out — extends to every document domain. Receipts for expense reports. Resumes for ATS ingestion. Loan applications for underwriting. Technical drawings for BOM extraction. Swap the schema; everything else is identical.

For truly adversarial documents (security clearance forms, medical records with abbreviations), chain an evaluator: have the model extract twice with different prompts and diff the outputs. Disagreements flag rows for human review.

# Re-use the pipeline with a different schema.
RESUME_SCHEMA = { "type": "object", "properties": {
    "name": {...}, "email": {...}, "experience": {...}, "skills": {...}
}}
resume = extract_document("candidate.pdf", RESUME_SCHEMA)

FAQ

How does this compare to traditional OCR?+

On clean printed documents the field-level accuracy is within a point of AWS Textract and Google Document AI. On messy scans, receipts, handwriting, and photos in bad lighting, vision LLMs pull clearly ahead — they read context, not just glyphs. They also need no template setup; a new vendor's invoice works immediately.

What's the cost per page?+

Kimi K2.6: roughly $0.003/page (your $5 signup credit covers thousands of pages). Opus 4.7: roughly $0.04/page. GPT-5.4 Vision: roughly $0.025/page. Multi-page docs get batched in one request — the per-doc overhead is paid once, not per page.

How does JSON mode stay reliable on messy docs?+

The schema is enforced at decode time — the model literally cannot emit a malformed structure. What you can still get wrong is field values (wrong number, missing line item). That's what the light validator step catches.

Is there a page-count limit?+

100 images per call on Kimi K2.6 and Opus; 50 on GPT-5.4 Vision. For longer docs, batch by chunks of 100 pages and merge results — or switch to the async jobs API for single-pass ingestion of 1000+ page PDFs.

Can I redact PII before extraction?+

Yes — AIgateway's guardrails primitive (Enterprise) runs a PII redaction pass before the model sees the document. On free/Pro tiers, ship the PDF through a local redactor first, then through the pipeline.

What about table extraction?+

Works well. For tables with merged cells or multi-line rows, phrase the schema as `rows: [[cell1, cell2, ...]]` and give the model an example in the prompt. For heavy-table use (financial statements), Opus 4.7 outperforms Kimi by a clear margin.

Can I run this at scale?+

Yes. One `x-aig-tag` per pipeline, one hard cap on the tag, and every document becomes a single billed unit with per-feature analytics. The gateway's async jobs API is the right surface for batch runs over 10k docs.

READY TO BUILD?
Get an AIgateway key in 30 seconds. $5 signup credit covers Kimi K2.6 and six other curated picks; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples