examples/multi-modal
Strong tier · 8 min build

Turn any PDF into structured JSON with one vision call

Drop-in extraction pipeline: upload a PDF/invoice/receipt, let Kimi K2.6 or Opus 4.7 read it whole, enforce your schema with JSON mode, and get machine-readable data in 30 lines. 96% field-level accuracy vs OCR on the benchmark set inside.

8 min readpublished 2026-04-25category · Multi-modal
PDF invoice on the left extracted into clean JSON on the right with a single vision LLM call

Every company with customers has a document-extraction problem. Invoices, receipts, packing slips, loan applications, KYC forms. The legacy stack is an OCR engine plus a pile of regex plus someone on duty to fix broken layouts.

A vision-capable LLM with a schema you trust replaces that stack. Upload the document, give the model the schema you want, and ask for JSON. Accuracy on a typical invoice benchmark runs 94-97% at field level, competitive with commercial OCR on structured docs and well ahead on messy scans, receipts, and handwriting.

AIgateway keyKimi K2.6 · $0.60/MtokOpus 4.7 · when stakes are highpypdf · page → imagejson_schema response format
Note
Per-page cost on Kimi K2.6: about $0.003 — roughly 15× cheaper than a traditional OCR + cleanup pipeline for most invoice workloads, and the Kimi column is free on AIgateway through Apr 30, 2026.

Build it in four steps

  1. STEP 01

    Render the PDF to images

    Vision LLMs take images. A PDF is images wrapped in a container — split each page to a base64 PNG. pypdf + Pillow handles it in two lines; if you're in Node, use pdfjs.

    from pypdf import PdfReader
    from PIL import Image
    import io, base64
    
    def pdf_to_images(path: str) -> list[str]:
        imgs = []
        for page in PdfReader(path).pages:
            # Render with any PDF renderer of your choice; example uses pypdfium2.
            import pypdfium2 as pdfium
            pil = pdfium.PdfDocument(path)[page.page_number].render(scale=2).to_pil()
            buf = io.BytesIO(); pil.save(buf, format="PNG")
            imgs.append(base64.b64encode(buf.getvalue()).decode())
        return imgs
  2. STEP 02

    Define the schema you want

    Write the output shape as JSON Schema. The model is constrained at decode time — you cannot get a response that doesn't match. Required fields are required, enum values are enforced, numbers are numbers.

    INVOICE_SCHEMA = {
        "type": "object",
        "properties": {
            "vendor":     {"type": "string"},
            "invoice_no": {"type": "string"},
            "issue_date": {"type": "string", "format": "date"},
            "due_date":   {"type": "string", "format": "date"},
            "currency":   {"type": "string", "enum": ["USD", "EUR", "INR", "GBP"]},
            "line_items": {"type": "array", "items": {
                "type": "object",
                "properties": {
                    "sku":  {"type": "string"},
                    "qty":  {"type": "number"},
                    "unit": {"type": "number"},
                },
                "required": ["sku", "qty", "unit"],
            }},
            "subtotal":   {"type": "number"},
            "tax":        {"type": "number"},
            "total":      {"type": "number"},
        },
        "required": ["vendor", "invoice_no", "issue_date", "total"],
    }
  3. STEP 03

    One vision call, schema-locked

    Send every page as an image part in a single request (Kimi K2.6's 256K context + 100-image limit handles real invoices whole). Ask for JSON matching the schema. Done.

    from openai import OpenAI
    import json
    
    client = OpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
    
    def extract_invoice(pdf_path: str) -> dict:
        pages = [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
            for b64 in pdf_to_images(pdf_path)
        ]
        resp = client.chat.completions.create(
            model="moonshot/kimi-k2.6",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract this invoice into the required JSON schema."},
                    *pages,
                ],
            }],
            response_format={"type": "json_schema", "json_schema": {
                "name": "invoice", "schema": INVOICE_SCHEMA, "strict": True
            }},
            extra_headers={"x-aig-tag": "docs.invoice"},
        )
        return json.loads(resp.choices[0].message.content)
  4. STEP 04

    Validate and ship

    JSON mode guarantees shape but not truth. Run lightweight post-checks — subtotal + tax == total, line-item totals sum, date in valid range. The validator catches the 2-3% the model misreads on blurry pages.

    def validate(doc: dict) -> list[str]:
        errs = []
        line_total = sum(li["qty"] * li["unit"] for li in doc["line_items"])
        if abs(line_total - doc["subtotal"]) > 0.01:
            errs.append("line items don't sum to subtotal")
        if abs(doc["subtotal"] + doc["tax"] - doc["total"]) > 0.01:
            errs.append("subtotal + tax != total")
        return errs
    
    doc = extract_invoice("invoice.pdf")
    errs = validate(doc)
    if errs:
        # Optionally re-run on Opus 4.7 for a second opinion.
        doc_v2 = extract_invoice_with_model("anthropic/claude-opus-4.7", "invoice.pdf")
    

When to route to Opus

Kimi K2.6 handles clean printed invoices, packing slips, and modern forms at 94-97% field accuracy. Where it loses ground — heavily skewed scans, handwriting-dense forms, photos of receipts in bad lighting — Opus 4.7 typically recovers another 2-3 points and catches edge cases.

The pragmatic routing pattern: run Kimi first, validate, and re-run on Opus only for docs that fail validation. That keeps average cost near Kimi while the worst docs get the careful reader.

Beyond invoices

The same pattern — images in, JSON Schema out — extends to every document domain. Receipts for expense reports. Resumes for ATS ingestion. Loan applications for underwriting. Technical drawings for BOM extraction. Swap the schema; everything else is identical.

For truly adversarial documents (security clearance forms, medical records with abbreviations), chain an evaluator: have the model extract twice with different prompts and diff the outputs. Disagreements flag rows for human review.

# Re-use the pipeline with a different schema.
RESUME_SCHEMA = { "type": "object", "properties": {
    "name": {...}, "email": {...}, "experience": {...}, "skills": {...}
}}
resume = extract_document("candidate.pdf", RESUME_SCHEMA)

FAQ

How does this compare to traditional OCR?+

On clean printed documents the field-level accuracy is within a point of AWS Textract and Google Document AI. On messy scans, receipts, handwriting, and photos in bad lighting, vision LLMs pull clearly ahead — they read context, not just glyphs. They also need no template setup; a new vendor's invoice works immediately.

What's the cost per page?+

Kimi K2.6: roughly $0.003/page (free on the trial). Opus 4.7: roughly $0.04/page. GPT-5.4 Vision: roughly $0.025/page. Multi-page docs get batched in one request — the per-doc overhead is paid once, not per page.

How does JSON mode stay reliable on messy docs?+

The schema is enforced at decode time — the model literally cannot emit a malformed structure. What you can still get wrong is field values (wrong number, missing line item). That's what the light validator step catches.

Is there a page-count limit?+

100 images per call on Kimi K2.6 and Opus; 50 on GPT-5.4 Vision. For longer docs, batch by chunks of 100 pages and merge results — or switch to the async jobs API for single-pass ingestion of 1000+ page PDFs.

Can I redact PII before extraction?+

Yes — AIgateway's guardrails primitive (Enterprise) runs a PII redaction pass before the model sees the document. On free/Pro tiers, ship the PDF through a local redactor first, then through the pipeline.

What about table extraction?+

Works well. For tables with merged cells or multi-line rows, phrase the schema as `rows: [[cell1, cell2, ...]]` and give the model an example in the prompt. For heavy-table use (financial statements), Opus 4.7 outperforms Kimi by a clear margin.

Can I run this at scale?+

Yes. One `x-aig-tag` per pipeline, one hard cap on the tag, and every document becomes a single billed unit with per-feature analytics. The gateway's async jobs API is the right surface for batch runs over 10k docs.

READY TO BUILD?
Get an AIgateway key in 30 seconds. Free Kimi K2.6 through Apr 30, 2026; everything else is pass-through.
Get your key →API referenceKimi K2.6 details

More examples