Drop-in extraction pipeline: upload a PDF/invoice/receipt, let Kimi K2.6 or Opus 4.7 read it whole, enforce your schema with JSON mode, and get machine-readable data in 30 lines. 96% field-level accuracy vs OCR on the benchmark set inside.
Every company with customers has a document-extraction problem. Invoices, receipts, packing slips, loan applications, KYC forms. The legacy stack is an OCR engine plus a pile of regex plus someone on duty to fix broken layouts.
A vision-capable LLM with a schema you trust replaces that stack. Upload the document, give the model the schema you want, and ask for JSON. Accuracy on a typical invoice benchmark runs 94-97% at field level, competitive with commercial OCR on structured docs and well ahead on messy scans, receipts, and handwriting.
Vision LLMs take images. A PDF is images wrapped in a container — split each page to a base64 PNG. pypdf + Pillow handles it in two lines; if you're in Node, use pdfjs.
from pypdf import PdfReader
from PIL import Image
import io, base64
def pdf_to_images(path: str) -> list[str]:
imgs = []
for page in PdfReader(path).pages:
# Render with any PDF renderer of your choice; example uses pypdfium2.
import pypdfium2 as pdfium
pil = pdfium.PdfDocument(path)[page.page_number].render(scale=2).to_pil()
buf = io.BytesIO(); pil.save(buf, format="PNG")
imgs.append(base64.b64encode(buf.getvalue()).decode())
return imgsWrite the output shape as JSON Schema. The model is constrained at decode time — you cannot get a response that doesn't match. Required fields are required, enum values are enforced, numbers are numbers.
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"vendor": {"type": "string"},
"invoice_no": {"type": "string"},
"issue_date": {"type": "string", "format": "date"},
"due_date": {"type": "string", "format": "date"},
"currency": {"type": "string", "enum": ["USD", "EUR", "INR", "GBP"]},
"line_items": {"type": "array", "items": {
"type": "object",
"properties": {
"sku": {"type": "string"},
"qty": {"type": "number"},
"unit": {"type": "number"},
},
"required": ["sku", "qty", "unit"],
}},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"total": {"type": "number"},
},
"required": ["vendor", "invoice_no", "issue_date", "total"],
}Send every page as an image part in a single request (Kimi K2.6's 256K context + 100-image limit handles real invoices whole). Ask for JSON matching the schema. Done.
from openai import OpenAI
import json
client = OpenAI(base_url="https://api.aigateway.sh/v1", api_key="sk-aig-...")
def extract_invoice(pdf_path: str) -> dict:
pages = [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
for b64 in pdf_to_images(pdf_path)
]
resp = client.chat.completions.create(
model="moonshot/kimi-k2.6",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract this invoice into the required JSON schema."},
*pages,
],
}],
response_format={"type": "json_schema", "json_schema": {
"name": "invoice", "schema": INVOICE_SCHEMA, "strict": True
}},
extra_headers={"x-aig-tag": "docs.invoice"},
)
return json.loads(resp.choices[0].message.content)JSON mode guarantees shape but not truth. Run lightweight post-checks — subtotal + tax == total, line-item totals sum, date in valid range. The validator catches the 2-3% the model misreads on blurry pages.
def validate(doc: dict) -> list[str]:
errs = []
line_total = sum(li["qty"] * li["unit"] for li in doc["line_items"])
if abs(line_total - doc["subtotal"]) > 0.01:
errs.append("line items don't sum to subtotal")
if abs(doc["subtotal"] + doc["tax"] - doc["total"]) > 0.01:
errs.append("subtotal + tax != total")
return errs
doc = extract_invoice("invoice.pdf")
errs = validate(doc)
if errs:
# Optionally re-run on Opus 4.7 for a second opinion.
doc_v2 = extract_invoice_with_model("anthropic/claude-opus-4.7", "invoice.pdf")
Kimi K2.6 handles clean printed invoices, packing slips, and modern forms at 94-97% field accuracy. Where it loses ground — heavily skewed scans, handwriting-dense forms, photos of receipts in bad lighting — Opus 4.7 typically recovers another 2-3 points and catches edge cases.
The pragmatic routing pattern: run Kimi first, validate, and re-run on Opus only for docs that fail validation. That keeps average cost near Kimi while the worst docs get the careful reader.
The same pattern — images in, JSON Schema out — extends to every document domain. Receipts for expense reports. Resumes for ATS ingestion. Loan applications for underwriting. Technical drawings for BOM extraction. Swap the schema; everything else is identical.
For truly adversarial documents (security clearance forms, medical records with abbreviations), chain an evaluator: have the model extract twice with different prompts and diff the outputs. Disagreements flag rows for human review.
# Re-use the pipeline with a different schema.
RESUME_SCHEMA = { "type": "object", "properties": {
"name": {...}, "email": {...}, "experience": {...}, "skills": {...}
}}
resume = extract_document("candidate.pdf", RESUME_SCHEMA)On clean printed documents the field-level accuracy is within a point of AWS Textract and Google Document AI. On messy scans, receipts, handwriting, and photos in bad lighting, vision LLMs pull clearly ahead — they read context, not just glyphs. They also need no template setup; a new vendor's invoice works immediately.
Kimi K2.6: roughly $0.003/page (free on the trial). Opus 4.7: roughly $0.04/page. GPT-5.4 Vision: roughly $0.025/page. Multi-page docs get batched in one request — the per-doc overhead is paid once, not per page.
The schema is enforced at decode time — the model literally cannot emit a malformed structure. What you can still get wrong is field values (wrong number, missing line item). That's what the light validator step catches.
100 images per call on Kimi K2.6 and Opus; 50 on GPT-5.4 Vision. For longer docs, batch by chunks of 100 pages and merge results — or switch to the async jobs API for single-pass ingestion of 1000+ page PDFs.
Yes — AIgateway's guardrails primitive (Enterprise) runs a PII redaction pass before the model sees the document. On free/Pro tiers, ship the PDF through a local redactor first, then through the pipeline.
Works well. For tables with merged cells or multi-line rows, phrase the schema as `rows: [[cell1, cell2, ...]]` and give the model an example in the prompt. For heavy-table use (financial statements), Opus 4.7 outperforms Kimi by a clear margin.
Yes. One `x-aig-tag` per pipeline, one hard cap on the tag, and every document becomes a single billed unit with per-feature analytics. The gateway's async jobs API is the right surface for batch runs over 10k docs.