PDF Meets LLM — The Tools, Trade-offs, and Pricing of Document Processing

PDF processing was one of the first things I worked on as an AI engineer. Back then it was all about OCR pipelines. Now with multimodal LLMs, you can send a document page as an image and ask the model to understand it. But that doesn’t mean OCR is dead — far from it.

Native vs Scanned — The First Decision
The PDF Processing Toolkit
Redaction Before External Processing
PDF to Image — The LLM Bridge
LLM Document Understanding — Provider Pricing
OCR Services — Traditional Extraction Pricing
LLM Extraction vs OCR — The Trade-off
OCR + LLM — The Best of Both Worlds
Decision Framework

Native vs Scanned — The First Decision

The first decision in any PDF pipeline: is the document native or scanned?

Native PDFs (digitally created) have embedded text — extract it directly, no OCR, no LLM, no cost. Scanned PDFs are just images in a PDF container — you need OCR or a multimodal LLM to read them.

graph LR
    PDF["PDF"] --> CHECK{"Native?"}
    CHECK -->|Yes| TEXT["Direct Text"]
    CHECK -->|No| IMG["Page Images"]
    IMG --> OCR["OCR Service"]
    IMG --> LLM["Multimodal LLM"]
    TEXT --> PIPE["Pipeline"]
    OCR --> PIPE
    LLM --> PIPE

    style PDF fill:#264653,stroke:#264653,color:#fff
    style CHECK fill:#e9c46a,stroke:#e9c46a,color:#000
    style TEXT fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style IMG fill:#e9c46a,stroke:#e9c46a,color:#000
    style OCR fill:#e76f51,stroke:#e76f51,color:#fff
    style LLM fill:#f4a261,stroke:#f4a261,color:#000
    style PIPE fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The PDF Processing Toolkit

For native PDFs, the Python ecosystem has solid tools:

Tool	Type	Best for	Notes
PyMuPDF [7] (fitz)	Python library	All-in-one (text + manipulation + rendering)	Fast C engine, no external deps
pikepdf [8]	Python library	Low-level PDF surgery, repair, linearization	Built on qpdf, handles corrupted PDFs
pypdf	Python library	Simple merge/split/encrypt	Pure Python, was PyPDF2, lightweight
ReportLab	Python library	Creating PDFs from scratch	Reports, invoices, charts
pdftk	CLI tool	Quick merge/split/rotate/encrypt	The classic, Java dependency
qpdf	CLI tool	Page manipulation, repair, linearization	Lightweight, no Java
Ghostscript	CLI tool	Compression, format conversion, rendering	Powerful but slow for large batches

PyMuPDF gives you plain text or line-by-line with bounding boxes (position, font, size) — critical for structured extraction where spatial position determines meaning. pikepdf repairs damaged PDFs and handles low-level surgery. pypdf is the lightweight option for simple merge/split.

# PyMuPDF — text with bounding boxes
import fitz
doc = fitz.open("document.pdf")
for page in doc:
    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if block["type"] == 0:
            for line in block["lines"]:
                for span in line["spans"]:
                    text, bbox = span["text"], span["bbox"]

# pikepdf — repair and decrypt
import pikepdf
pdf = pikepdf.open("damaged.pdf")
pdf.save("repaired.pdf")

# CLI tools for shell workflows
pdftk doc1.pdf doc2.pdf cat output merged.pdf
qpdf --empty --pages doc1.pdf 1-5 doc2.pdf 3-10 -- merged.pdf

My rule: PyMuPDF when I need text extraction + manipulation together. pikepdf for corrupted files. pypdf for minimal dependencies. pdftk/qpdf for shell one-liners.

Redaction Before External Processing

When dealing with PII, financial data, or medical records, redact before sending documents to any external service. PyMuPDF’s apply_redactions() actually removes underlying content — not just a black rectangle overlay. Some naive approaches just draw over text, which is still extractable. Redact first, extract second.

PDF to Image — The LLM Bridge

Converting pages to images is essential for feeding documents to multimodal LLMs or OCR services:

import fitz
doc = fitz.open("document.pdf")
page = doc[0]
pix = page.get_pixmap(dpi=300)
image_bytes = pix.tobytes("png")
# Send to any LLM or OCR service

Both Textract [1] and Azure Document Intelligence [2] support batch document processing, but can be slow for large docs. When you don’t need cross-page layout analysis, send pages individually as images for better parallelism and error handling.

LLM Document Understanding — Provider Pricing

Every major LLM provider supports image/document input, but pricing varies wildly. Important: you pay for both input tokens (the image) and output tokens (the extracted text). Most comparisons only show input cost, which is misleading.

Assuming ~500 output tokens per page when extracting text as markdown:

Provider	Model	Input $/M	Output $/M	Input tokens/page	Total per 1K pages
Google [3]	Gemini Flash 2.5	$0.30	$2.50	~250-500	~$1.35-1.40
OpenAI [5]	GPT-4o-mini	$0.15	$0.60	~765-1,105	~$0.41-0.47
OpenAI [5]	GPT-4o	$2.50	$10.00	~765-1,105	~$6.90-7.75
Anthropic [4]	Claude Haiku 4.5	$1.00	$5.00	~1,500-3,000	~$4.00-5.50
Anthropic [4]	Claude Sonnet 4.6	$3.00	$15.00	~1,500-3,000	~$12.00-16.50

OpenAI divides images into 512×512 tiles in high detail mode — 170 tokens/tile + 85 base. A typical page (~1024×1024) is ~765 tokens. Low detail: flat 85 tokens.

Anthropic extracts text AND converts each page to an image — you pay for both. A 50-page document can consume 75,000-150,000 tokens just in context.

Gemini treats each PDF page as one image with fixed token cost — the cheapest LLM option for document processing.

The OmniAI OCR benchmark [11] tested 9 providers on 1,000 documents. Gemini Flash achieved the best CER (15%) among multimodal LLMs, vs 25% for GPT-4o. Traditional OCR still leads on pure accuracy, but the gap has narrowed — especially for printed text.

OCR Services — Traditional Extraction Pricing

AWS Textract [1] (per 1,000 pages, US region):

Feature	First 1M pages	After 1M pages
Detect Text (OCR only)	$1.50	$0.60
Tables	$15.00	$10.00
Forms (key-value pairs)	$50.00	$30.00
Queries (custom questions)	$25.00	$15.00
Tables + Forms + Queries	$90.00	$55.00

Azure Document Intelligence [2] (per 1,000 pages):

Model	Price per 1,000 pages
Read (OCR text extraction)	$1.50
Layout (text + tables + structure)	$10.00
Prebuilt (invoices, receipts, IDs)	$10.00
Custom extraction	$25.00

Gemini Flash 2.5 at ~$1.35/1K is comparable to basic OCR ($1.50/1K) — but you get document understanding, not just raw text. GPT-4o-mini at ~$0.41/1K is the cheapest overall. Claude Sonnet at ~$12-16.50/1K is 8-10x more expensive than basic OCR.

LLM Extraction vs OCR — The Trade-off

Gemini’s document understanding [3] is impressive for the price:

import google.generativeai as genai

model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content([
    "Extract all text from this document page in markdown format.",
    {"mime_type": "image/png", "data": image_bytes}
])

But there’s a catch: hallucination. LLMs sometimes add content that isn’t there, misread numbers, or reformat in meaning-changing ways. OCR has no hallucination risk — it either reads the character correctly or it doesn’t.

OCR + LLM — The Best of Both Worlds

The approach that actually works best for information extraction: combine OCR and LLM. Instead of asking the LLM to both read and understand the document (image → LLM), split the responsibilities: OCR handles reading, LLM handles understanding.

graph LR
    IMG["Page Image"] --> OCR["OCR Service"]
    OCR --> TXT["Accurate Text"]
    TXT --> LLM["LLM"]
    LLM --> OUT["Structured Data"]

    style IMG fill:#264653,stroke:#264653,color:#fff
    style OCR fill:#e76f51,stroke:#e76f51,color:#fff
    style TXT fill:#e9c46a,stroke:#e9c46a,color:#000
    style LLM fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style OUT fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The naive approach sends the image directly to an LLM — it does OCR and reasoning in one shot. When it fails, you don’t know which step failed. Was the text misread, or was the logic wrong?

# Naive: image → LLM (OCR + reasoning in one shot)
response = model.generate_content([
    "Extract invoice number, date, and total.",
    {"mime_type": "image/png", "data": image_bytes}
])
# Risk: misread characters, hallucinated fields

# Better: OCR → text → LLM (separated concerns)
ocr_text = textract_client.detect_document_text(image_bytes)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract structured data from this OCR text."},
        {"role": "user", "content": f"Extract invoice number, date, total:\n\n{ocr_text}"},
    ]
)

OCR gives you reliable text (no hallucination). LLM operates on text (which it’s great at) instead of pixels (where it stumbles). And text tokens are cheaper than image tokens.

Approach	OCR cost	LLM cost	Total per 1K pages	Accuracy
Image → LLM (naive)	$0	~$0.23-16.50	~$0.23-16.50	Moderate (hallucination risk)
OCR → LLM (combined)	$1.50	~$0.05-0.50	~$1.55-2.00	High (no vision errors)
OCR → LLM + structured output	$1.50	~$0.10-1.00	~$1.60-2.50	Highest (validated schema)

The sweet spot: basic OCR ($1.50/1K) + GPT-4o-mini for reasoning (~$1.55-2.00 total per 1K pages). For native PDFs, replace OCR with direct text extraction (free).

Decision Framework

Need	Approach	Cost per 1K pages	Why
Exact text from native PDFs	PyMuPDF / pypdf (direct)	Free	No OCR needed, perfect fidelity
Summarize or quick understanding	Image → Gemini Flash 2.5 or GPT-4o-mini	~$0.41-1.35	Cheap, good enough when exact text isn’t critical
Exact text from scanned docs	Textract or Azure (Read)	$1.50	Reliable OCR, no hallucination
Robust information extraction	OCR → LLM (text, not image)	~$1.55-2.00	Best trade-off: OCR accuracy + LLM reasoning
Table extraction	Textract or Azure (Layout)	$10-15	Structured output with positions
Complex understanding	Image → Claude Sonnet or GPT-4o	~$7-17	Best reasoning, most expensive
Forms and key-value pairs	Textract or Azure (Forms)	$10-50	Accurate but expensive
Compliance-critical	OCR + human review	$1.50-50	Zero hallucination risk

Always check if the PDF is native first. If it is, you get perfect text for free. For scanned documents, choose based on accuracy needs and budget — LLM for understanding, OCR for fidelity.

What’s your PDF processing stack? Are you using LLM-based extraction, or sticking with traditional OCR?

References:

[1] “Amazon Textract — Pricing.” AWS.
[2] “Azure Document Intelligence — Pricing.” Microsoft Azure.
[3] “Gemini Developer API — Pricing.” Google AI.
[4] “Vision — Claude API.” Anthropic.
[5] “Pricing.” OpenAI.
[6] “Images and Vision.” OpenAI.
[7] “PyMuPDF Documentation.” Artifex.
[8] “pikepdf Documentation.” pikepdf.
[9] “Amazon Textract — Features.” AWS.
[10] “Document Intelligence — Layout Model.” Microsoft Learn.
[11] “OmniAI OCR Benchmark.” OmniAI.