Dquan’s LLM Notes

Harnesses Aren’t Portable — Why Each CLI Agent Has Its Own

2026-04-17T00:00:00+00:00

Someone told me recently: “you need to read the output — generating text doesn’t mean anything.” That line stuck. If I plug the OpenAI API into Claude Code, it will absolutely produce tokens. The loop runs, tool calls fire, files get written. But whether any of that is useful — whether the work is actually good — nobody knows until someone reads it. Tokens flowing is not the same as work getting done.

Agent = Model + Harness
Three CLIs, Three Harnesses
The Tools Are Model-Coupled
Sandboxes Assume a Trust Profile
The Hard Evidence: 59.6%
Why This Matters in Practice

Agent = Model + Harness

This is the practical edge of a framing that LangChain [1] has been pushing: Agent = Model + Harness. “A harness is every piece of code, configuration, and execution logic that isn’t the model itself. The model contains the intelligence and the harness is the system that makes that intelligence useful.” The harness is the system prompt, the tool set, the sandbox, the context management, the middleware.

And here’s the part that matters for anyone building with these tools today: harnesses aren’t portable between models. Each CLI coding agent — Claude Code, Gemini CLI, OpenAI Codex CLI — is a harness that was co-designed with a specific model. Swap the model out and you get tokens without the quality.

Three CLIs, Three Harnesses

On the surface these tools all do the same thing: you type, the agent edits files and runs shell commands. Underneath, the architectures are deeply different.

Dimension	Claude Code	Gemini CLI	OpenAI Codex CLI
Signature tool	`TodoWrite`, Skills, sub-agents (Task)	Google Search grounding as a built-in tool	`apply_patch` (V4A patch format)
Memory convention	`CLAUDE.md` re-injected each session	`GEMINI.md`	AGENTS.md, terse system prompts
Sandbox	Permission modes + Auto mode (Sonnet classifier approves safe calls)	Trusted Folders, optional container	Mandatory: Seatbelt (macOS) / Bubblewrap+Landlock (Linux)
Context strategy	Auto-compact at ~92%, tiered drop/summarize	Leans on 1M-token Gemini 3 window	Assumes reasoning-model internal scratchpad; terse external context
Hooks/extensibility	`PreToolUse`, `PostToolUse`, `SessionStart`, MCP, slash commands	MCP, media-gen MCPs (Imagen/Veo/Lyria)	Approval policies: `untrusted`, `on-request`, `never`

Each of these choices is tuned for one specific model’s strengths and quirks.

The Tools Are Model-Coupled

The clearest evidence is the tools themselves. apply_patch in Codex isn’t a generic edit tool — it’s a V4A-style structured patch format [2] that GPT-5-Codex and the o-series were explicitly post-trained to emit correctly. Hand the same tool description to Claude or Gemini and you’ll get syntactically valid patches sometimes, but the model wasn’t trained to reason in that format.

Claude’s TodoWrite is the mirror image. It technically does nothing [3] — it’s a no-op that just writes the plan into the conversation. But Claude models are trained to use it as an externalization anchor during long tasks. Drop TodoWrite into a harness around a non-Claude model and it becomes dead weight. The tool only works because the model knows how to use it.

Gemini CLI has Google Search as a first-class grounding tool. Claude Code and Codex don’t — they have WebFetch and WebSearch wrappers, but nothing like native grounding where the model was trained to interleave search calls with generation.

Model-coupled tools (the ones you can't port):

Claude Code    → TodoWrite, Skills, sub-agent Task tool
Gemini CLI     → Google Search grounding
Codex CLI      → apply_patch (V4A format)

Sandboxes Assume a Trust Profile

Sandboxes aren’t just security — they’re a bet about how the model behaves. Codex makes its sandbox mandatory because reasoning models are trusted to plan long autonomous sequences; the sandbox exists precisely so the human doesn’t need to approve every step. Claude Code’s Auto mode goes the other way: it uses a Sonnet classifier [4] to decide which tool calls are safe to auto-approve — which is a harness-level design choice that literally only works because Sonnet is the one classifying. You can’t lift that component out and run it in a different harness without bringing the model with it.

Gemini CLI’s lighter “Trusted Folders” model reflects a different assumption again — that long context (1M tokens) carries enough of the workspace state that per-call approval adds less value than it costs in friction.

The Hard Evidence: 59.6%

Principles transfer across harnesses. Performance numbers don’t. LangChain ran the experiment directly: they took Claude Opus 4.6 and plugged it into Codex’s harness. The result was 59.6% on Terminal Bench 2.0 [5] — worse than Codex’s own model on the same harness. Their own explanation: “we didn’t run the same Improvement Loop with Claude.” The harness had been iteratively tuned against GPT-5 Codex’s specific failure modes. A different model hits different failure modes, which the harness was never shaped to catch.

This is the quiet version of the “generating ≠ doing” point. The foreign model’s loop completed. Tokens were produced. Tools got called. The score dropped 7+ points anyway.

Why This Matters in Practice

graph LR
    M["Model"] --> H["Harness"]
    H --> O["Output"]
    O -.read.-> V["Verify"]
    V -.tune.-> H

    style M fill:#264653,color:#fff
    style H fill:#2a9d8f,color:#fff
    style O fill:#e9c46a,color:#000
    style V fill:#e76f51,color:#fff

Three things follow from this:

There is no one-size-fits-all harness. Every CLI agent on the market today is a co-design. Picking Claude Code isn’t just picking Anthropic’s model — it’s picking a tool set, prompt style, and sandbox philosophy that were tuned together.
Model swaps need harness re-tuning. If you want to run a different model through a CLI you love, expect to re-run the improvement loop — new failure modes, new middleware, new system prompt. You’re not swapping a part, you’re rebuilding the scaffolding.
Read the output. The loop completing is not the signal. Tokens are not quality. The only way to know if your agent is actually working is to look at what it produced and evaluate it against the task — which, ironically, is the same rule that applies to the humans using these tools.

The uncomfortable implication: if you’ve been running evals by counting successful tool calls or passing unit tests that the agent itself wrote, you might be measuring the harness convincing itself, not the work getting done.

What harness-level change has made the biggest quality difference for you — a middleware, a sandbox policy, a system-prompt tweak? I’d love to hear what actually moved the needle.

References:

[1] Harrison Chase. “The Anatomy of an Agent Harness.” LangChain Blog 2025.
[2] OpenAI. “Codex CLI.” GitHub 2025.
[3] Harrison Chase. “Deep Agents.” LangChain Blog 2025.
[4] “Claude Code Auto Permission Guide.” SmartScope 2025.
[5] LangChain Team. “Improving Deep Agents with Harness Engineering.” LangChain Blog 2025.
[6] OpenAI. “Codex Sandboxing.” OpenAI Developer Docs 2025.
[7] Google. “gemini-cli.” GitHub 2025.

Anatomy of a Claude Code Session — What’s Built-in, What’s Configurable, and What You Control

2026-04-05T00:00:00+00:00

Every time you launch Claude Code, a small orchestra of context layers assembles before you type a single character. The system prompt loads. Built-in tools register. Your CLAUDE.md files get read. Skills discover themselves. MCP servers connect. Memory loads from previous sessions. Most of this is invisible — and understanding it tells you where to invest your customization effort.

The Full Stack at a Glance
Layer 1: The System Prompt
Layer 2: Built-in Tools
Layer 3: Environment Context
Layer 4: CLAUDE.md — The Highest-Leverage Layer
Layer 5: Settings (settings.json)
Layer 6: Hooks — Deterministic Automation (Opt-in)
Layer 7: Skills — Custom Slash Commands
Layer 8: MCP Servers — External Tool Integrations (Opt-in)
Layer 9: Memory — Claude’s Persistent Notes
How the Layers Compose
Practical Decision Framework
The Meta-Principle

The Full Stack at a Glance

Here’s the layered architecture:

Layer	Who controls it	Default?	Can disable?
System prompt	Anthropic	Always on	No
Built-in tools	Anthropic + you (permissions)	Always on	Partially (permission modes)
Environment context	Auto-detected	Always on	No
CLAUDE.md files	You (team + personal)	Loaded when present	Don’t create the file
Settings (settings.json)	You	Defaults apply	Yes — override per key
Hooks	You	Off (opt-in)	Yes — remove from settings
Skills	You	Discovered when present	Yes — `user-invocable: false`
MCP servers	You	Off (opt-in)	Yes — remove from settings
Memory	Claude (you direct)	On when configured	Yes — delete memory files

Layer 1: The System Prompt

This is the foundation — a large set of instructions that Anthropic injects at the start of every session. You never see it directly (unless you ask Claude to reflect on it [1], or extract it through prompt injection research). It covers:

Core behavioral instructions — how to approach tasks, when to ask for confirmation, how to handle errors
Tool usage guidelines — “use Read instead of cat,” “use Grep instead of rg,” specific protocols for each tool
Safety rules — what to refuse, how to handle destructive operations, when to confirm before acting
Git protocols — detailed step-by-step instructions for commits and PRs (this is why Claude Code’s git workflow is so consistent)
Output style — “go straight to the point,” “keep responses short and concise,” “don’t add features beyond what was asked”
Security awareness — OWASP top 10, command injection prevention, credential handling

The system prompt is substantial — thousands of tokens of carefully tuned instructions. It’s why Claude Code behaves differently from raw Claude in the API. You can’t modify it, but understanding it explains a lot of Claude Code’s default behaviors. For example, the reason Claude Code always uses Read instead of cat isn’t a learned preference — it’s an explicit instruction in the system prompt.

One important detail: the system prompt includes a “do not overdo it” philosophy. It explicitly says not to add docstrings, comments, or type annotations to code you didn’t change. Not to add error handling for impossible scenarios. Not to create abstractions for one-time operations. If you’ve noticed Claude Code being more restrained than raw Claude, this is why.

Layer 2: Built-in Tools

Claude Code ships with a fixed set of tools that are always available:

Tool	Purpose
`Read`	Read files (text, images, PDFs, notebooks)
`Write`	Create new files or complete rewrites
`Edit`	Exact string replacements in existing files
`Bash`	Execute shell commands
`Grep`	Search file contents (ripgrep-powered)
`Glob`	Find files by pattern
`Agent`	Spawn sub-agents for complex tasks
`Skill`	Invoke user-defined skills
`ToolSearch`	Discover deferred/MCP tools

These tools are always registered — you can’t remove them. But you control access through permission modes [2]:

Mode	Behavior
`default`	Prompts for approval on potentially risky operations
`auto`	Auto-approves most operations (still blocks destructive ones)
`plan`	Requires plan approval before execution
`bypassPermissions`	No prompts (use with caution)

You can also set per-tool permissions in settings.json:

{
  "permissions": {
    "allow": ["Read", "Grep", "Glob"],
    "deny": ["Bash(rm *)"]
  }
}

The permission system is granular — you can allow Bash generally but deny specific patterns like Bash(rm -rf) or Bash(git push --force).

Layer 3: Environment Context

Before you type anything, Claude Code auto-detects and injects environment information:

Working directory and whether it’s a git repo
Current git branch, recent commits, and git status
Platform (macOS, Linux), shell (bash, zsh), OS version
Current model name and context window size
Current date

This is why Claude Code knows your branch name, can reference recent commits, and adapts commands to your OS. It’s always on, always fresh per session.

Layer 4: CLAUDE.md — The Highest-Leverage Layer

This is where you should invest most. Claude Code reads markdown instruction files from multiple locations, layered by scope:

Scope	Location	Shared with team?	Loaded when?
User (global)	`~/.claude/CLAUDE.md`	No	Every session
Project	`./CLAUDE.md`	Yes (committed)	When in project dir
Project local	`./CLAUDE.local.md`	No (gitignored)	When in project dir
Rules	`.claude/rules/*.md`	Yes (committed)	Based on file paths

All of these load automatically when they exist. You “disable” them by not having the file. The layering means:

Team conventions go in ./CLAUDE.md — committed, version-controlled, reviewed in PRs
Personal preferences go in ~/.claude/CLAUDE.md or ./CLAUDE.local.md
Topic-specific rules go in .claude/rules/ — loaded on demand based on which files Claude is working with
Path-specific rules use frontmatter to scope activation:

# .claude/rules/api-conventions.md
---
paths:
  - "src/api/**/*.ts"
---

All API endpoints must validate input with Zod schemas.

Practical recommendation from experienced users [3]: keep the main CLAUDE.md under 200 lines. Push details into rules files.

Layer 5: Settings (settings.json)

The meta-configuration layer. Settings control permissions, hooks, MCP servers, model preferences, and more. They follow the same global/project layering:

Scope	Location
Global	`~/.claude/settings.json`
Project (shared)	`.claude/settings.json`
Project (local)	`.claude/settings.local.json`

Settings merge with project settings taking precedence over global. This is the control plane for hooks, MCP, and permissions.

Layer 6: Hooks — Deterministic Automation (Opt-in)

Hooks [4] are shell commands that execute at specific points in Claude’s workflow. Completely opt-in — nothing runs until you configure it.

Hook Event	When it fires	Use case
`PreToolUse`	Before a tool runs	Block unsafe operations, enforce tool preferences
`PostToolUse`	After a tool succeeds	Auto-format, lint, log changes
`UserPromptSubmit`	When you send a prompt	Inject context, route to skills
`Stop`	When Claude finishes responding	Run tests, type-check, verify builds
`SubagentStop`	When a sub-agent finishes	Validate sub-agent output

The key distinction: hooks are deterministic. You’re not asking Claude to remember to run the linter — you’re guaranteeing it runs. A PreToolUse hook can block an operation (exit code 2) or modify it.

Default: nothing. Pure opt-in. But once configured, hooks are the most reliable enforcement mechanism — more reliable than instructions in CLAUDE.md, because they execute as code, not as suggestions.

Layer 7: Skills — Custom Slash Commands

Skills [5] are markdown files that create slash commands. They live in .claude/skills//SKILL.md (project) or ~/.claude/skills//SKILL.md (personal).

Skills are discovered automatically when the directory exists. Fine-grained activation controls:

Control	What it does	Default
`user-invocable: true`	Shows in slash menu for manual invocation	true
`disable-model-invocation: true`	Prevents Claude from auto-activating	false
`paths: ["src/**"]`	Only activates for matching files	none (always available)
`context: fork`	Runs in isolated sub-agent	false (runs in main context)

Skills can be model-invocable by default — if your skill’s description matches a user request, Claude may auto-activate it. This is powerful but not 100% reliable [3], which is why the UserPromptSubmit hook pattern exists as a deterministic fallback.

Layer 8: MCP Servers — External Tool Integrations (Opt-in)

MCP servers [6] extend Claude Code with external tools — browser automation, database access, Figma integration, Jira, and anything with an MCP adapter.

Purely opt-in. Configured in settings.json:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@anthropic-ai/mcp-playwright"]
    }
  }
}

Once configured, MCP tools appear alongside built-in tools. Claude discovers them via ToolSearch (deferred loading saves context tokens). MCP servers can be scoped globally or per-project.

Layer 9: Memory — Claude’s Persistent Notes

The memory system [7] stores Claude’s notes across sessions in ~/.claude/projects//memory/.

Memory is a hybrid: the system is always available, but only contains content if Claude (or you) have written to it. The first 200 lines of MEMORY.md load automatically at session start. Memory types:

Type	What it captures	Example
`user`	Your role, preferences, expertise	“Senior engineer, prefers terse output”
`feedback`	Corrections and confirmations	“Don’t mock the database in tests”
`project`	Ongoing work context	“Merge freeze starts April 5”
`reference`	Pointers to external systems	“Bugs tracked in Linear project INGEST”

How the Layers Compose

graph LR
    SP["System Prompt\n(behavior)"] --> CC["Claude Code\nSession"]
    TOOLS["Built-in Tools\n(capabilities)"] --> CC
    ENV["Environment\n(situation)"] --> CC
    CMD["CLAUDE.md\n(conventions)"] --> CC
    HOOKS["Hooks\n(enforcement)"] --> CC
    SKILLS["Skills\n(workflows)"] --> CC
    MCP["MCP Servers\n(integrations)"] --> CC
    MEM["Memory\n(learnings)"] --> CC

    style SP fill:#264653,stroke:#264653,color:#fff
    style TOOLS fill:#264653,stroke:#264653,color:#fff
    style ENV fill:#264653,stroke:#264653,color:#fff
    style CMD fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style HOOKS fill:#e76f51,stroke:#e76f51,color:#fff
    style SKILLS fill:#f4a261,stroke:#f4a261,color:#000
    style MCP fill:#e9c46a,stroke:#e9c46a,color:#000
    style MEM fill:#e9c46a,stroke:#e9c46a,color:#000
    style CC fill:#40916c,stroke:#40916c,color:#fff

Dark teal nodes are Anthropic-controlled. Green/orange/yellow nodes are yours.

Practical Decision Framework

You want to…	Put it in…	Why
Set coding conventions for the team	`./CLAUDE.md` or `.claude/rules/`	Version-controlled, shared
Set personal preferences	`~/.claude/CLAUDE.md`	Applies to all projects
Guarantee a check runs	Hooks in `settings.json`	Deterministic, not a suggestion
Create a reusable workflow	Skills (`.claude/skills/`)	Discoverable, parameterized
Add external tool access	MCP servers in `settings.json`	Extends capability
Record project learnings	Memory (let Claude save it)	Persists across sessions
Block dangerous operations	PreToolUse hook or permission deny rules	Hard enforcement
Scope rules to file types	`.claude/rules/*.md` with `paths:` frontmatter	Loads only when relevant

The Meta-Principle

Use instructions (CLAUDE.md) for guidance, hooks for enforcement, skills for workflows, and MCP for capabilities. Instructions can be ignored under pressure. Hooks can’t.

The best Claude Code setups treat these layers as complementary, not competing. CLAUDE.md sets the direction. Hooks enforce the guardrails. Skills provide the playbooks. Memory accumulates the context. And the system prompt — the layer you can’t touch — provides the behavioral foundation that makes all of it work.

Don’t fight the system prompt — work with it. If you want Claude to be more autonomous, adjust permission modes rather than writing “never ask for confirmation” in CLAUDE.md. If you want stricter enforcement, use hooks rather than emphatic instructions.

What layer has been most impactful for your workflow?

References:

[1] Simon Willison. “Claude Code’s system prompt.” simonwillison.net 2025.
[2] “Claude Code — Settings.” Anthropic.
[3] u/JokeGold5455. “Claude Code is a Beast — Tips from 6 Months of Hardcore Use.” r/ClaudeCode 2025.
[4] “Claude Code — Hooks.” Anthropic.
[5] “Claude Code — Skills.” Anthropic.
[6] “Claude Code — MCP Servers.” Anthropic.
[7] “Claude Code — Memory, CLAUDE.md, and .claude/rules.” Anthropic.
[8] “Claude Code — Overview.” Anthropic.
[9] “Claude Code — Sub-agents.” Anthropic.

Claude Code as Your Team’s Knowledge Layer — CLAUDE.md, Hooks, Skills, and the Onboarding Problem

2026-04-03T00:00:00+00:00

Think about what happens when a new developer joins your team. There’s a knowledge transfer session — someone walks them through the architecture, the coding conventions, the “we tried X but it didn’t work” stories. They spend weeks absorbing tribal knowledge that lives in people’s heads and Slack threads.

Now think about what happens when you start a new Claude Code session. It reads your CLAUDE.md, loads your hooks, discovers your skills, checks its memory. In seconds, it has context that took the new developer weeks to build. The interesting part isn’t that Claude Code can do this — it’s that the infrastructure you build for Claude Code is the exact same infrastructure your team needs for onboarding.

The Mapping
CLAUDE.md — Your Coding Conventions, Version-Controlled
Hooks — Deterministic Enforcement, Not Suggestions
Skills — Your Team’s Playbooks
Memory — What Claude Tells Itself
The Documentation Lifecycle — Why This Actually Works
Practical Setup Guide
The Meta-Insight

The Mapping

Much of what I’ll describe here was inspired by a fantastic Reddit post by u/JokeGold5455 [1] — a software engineer who solo-rewrote a ~100K LOC internal tool into ~300K LOC over 6 months using Claude Code on the $200/month Max plan. They extracted their patterns into an open-source showcase repo [2] that’s one of the best references I’ve found for production Claude Code usage. I’ll build on their patterns and connect them to the official Claude Code documentation [3].

This post is about treating Claude Code not just as a coding assistant, but as a forcing function for documenting and enforcing your team’s knowledge.

Team Need	Traditional Approach	Claude Code Feature
Coding conventions	Wiki page (often stale)	CLAUDE.md
Code review standards	Reviewer memory	Hooks (pre/post tool)
Common workflows	Tribal knowledge	Skills (slash commands)
Architecture context	Onboarding doc (outdated by week 2)	Memory + Rules
“Don’t do X” guardrails	PR review comments	PreToolUse hooks

CLAUDE.md — Your Coding Conventions, Version-Controlled

CLAUDE.md [4] is a markdown file that Claude reads at the start of every session. Put it in your repo root, and it becomes your project’s instruction manual. The key insight is the layering system:

Scope	Location	Who writes it	Shared?
Project	`./CLAUDE.md`	Team (committed to repo)	Yes
Rules	`.claude/rules/*.md`	Team (committed to repo)	Yes
Local	`./CLAUDE.local.md`	Individual (gitignored)	No
User	`~/.claude/CLAUDE.md`	Individual	No

The project CLAUDE.md is your team’s coding conventions — committed, version-controlled, reviewed in PRs just like code. When someone updates the convention, everyone (including Claude) gets it in the next pull. Rules files let you split by topic: code-style.md, testing.md, security.md. Path-specific rules mean your API team’s conventions only load when Claude is working on API files:

# .claude/rules/api-conventions.md
---
paths:
  - "src/api/**/*.ts"
---

All API endpoints must:
- Validate input with Zod schemas
- Return standard error format { error: string, code: number }
- Include request ID in response headers

CLAUDE.local.md is for personal preferences that don’t belong in the shared repo. The layering means team standards and personal preferences coexist without conflicts.

Keep CLAUDE.md under 200 lines. It loads into every session, and bloated instructions reduce adherence. Move detailed content into .claude/rules/ files [5] — they load on demand based on what files Claude is working with.

Hooks — Deterministic Enforcement, Not Suggestions

Hooks [6] are shell commands that execute at specific points in Claude’s workflow — deterministic automation, not suggestions. You’re not asking Claude to remember to run the linter, you’re guaranteeing it runs.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "npx prettier --write \"$(cat /dev/stdin | jq -r '.tool_input.file_path')\""
          }
        ]
      }
    ]
  }
}

The hook types map to different team needs:

Hook Event	When it fires	Team use case
PreToolUse	Before a tool runs	Block unsafe operations, enforce tool preferences
PostToolUse	After a tool succeeds	Auto-format, lint, log changes
UserPromptSubmit	When you send a prompt	Inject context, activate relevant skills
Stop	When Claude finishes responding	Run tests, check types, verify builds

The UserPromptSubmit hook is particularly clever [2]. It reads your prompt, matches it against keyword/regex patterns, and injects relevant skill suggestions before Claude processes the prompt. This solves the problem that skills don’t auto-activate reliably — the hook makes activation deterministic.

One important caveat from u/JokeGold5455’s experience [1]: be careful with PostToolUse hooks that modify files. Each file modification triggers a system reminder with the diff, which consumes context tokens. A Prettier hook that runs on every edit can eat 160K tokens in just a few rounds. Use Stop hooks for non-blocking checks instead.

Hooks receive JSON on stdin and communicate back through exit codes: 0 means proceed, 2 means block. This lets you build guardrails:

#!/usr/bin/env python3
# .claude/hooks/check-env-files.py
import sys, json

event = json.load(sys.stdin)
file_path = event.get("tool_input", {}).get("file_path", "")

if ".env" in file_path and not file_path.endswith(".env.example"):
    print("Blocked: do not edit .env files directly.", file=sys.stderr)
    sys.exit(2)

Skills — Your Team’s Playbooks

Skills [7] are markdown files that extend Claude’s capabilities with custom slash commands. Think of them as your team’s playbooks — documented procedures that Claude follows step by step.

# .claude/skills/deploy/SKILL.md
---
name: deploy
description: Deploy to staging or production environment
disable-model-invocation: true
---

Deploy to $1 environment:

1. Run the test suite: `npm test`
2. Build the project: `npm run build`
3. Check for uncommitted changes
4. If deploying to production, require explicit confirmation
5. Run: `./scripts/deploy.sh $1`
6. Verify health check at the deployed URL

The recommended pattern [2] keeps each skill under 500 lines with progressive disclosure:

.claude/skills/
  backend-dev/
    SKILL.md              # Overview + navigation (<500 lines)
    resources/
      api-patterns.md     # Deep dive on API patterns
      db-migrations.md    # Database migration guide
      error-handling.md   # Error handling conventions

Claude loads the main SKILL.md first, then pulls resource files only when needed. This improved token efficiency 40-60% compared to monolithic skill files.

Key configuration options:

Option	Purpose	Example
`disable-model-invocation: true`	Only you can trigger (deploys, commits)	Prevents accidental deploys
`user-invocable: false`	Only Claude can trigger	Background knowledge
`context: fork`	Runs in isolated subagent context	Heavy tasks that won’t bloat main context
`paths: ["src/*/.ts"]`	Only loads for matching files	Language-specific conventions

Skills can inject dynamic context using shell commands:

---
name: pr-review
description: Review current pull request
---

## PR Context
- Diff: !`gh pr diff`
- Files changed: !`gh pr diff --name-only`

Review this PR for: code style, test coverage, security issues.

Memory — What Claude Tells Itself

The memory system [4] is Claude’s own notes — things it learns during conversations and persists for future sessions. It’s stored in ~/.claude/projects//memory/ and automatically loaded at session start.

This is different from CLAUDE.md:

What	Where to put it	Why
“Always use 2-space indent”	CLAUDE.md	Team convention, you define it
“This project uses direnv, run `direnv allow`”	Memory	Claude learned it, reminds itself
“User prefers terse responses”	Memory	Personal preference, not a project rule
“API routes live in src/api/handlers/”	CLAUDE.md or rules/	Project structure, should be explicit

The Documentation Lifecycle — Why This Actually Works

The traditional problem with team documentation is that it goes stale. Someone writes an architecture doc, it’s accurate for a month, then the code drifts and nobody updates the doc. This is why tribal knowledge exists — the real conventions live in people’s heads because the written docs can’t be trusted.

Claude Code changes this dynamic because the documentation isn’t just for humans — it’s for your coding assistant too. When CLAUDE.md is wrong, Claude does the wrong thing. When a hook is misconfigured, builds break. When a skill is outdated, workflows fail. The documentation has an immediate feedback loop with code production, which means it actually gets maintained.

The dev-docs pattern from u/JokeGold5455 [1] makes this explicit. For every large task, create three files:

dev/active/feature-name/
  plan.md      # Strategic plan — what we're building and why
  context.md   # Key files, decisions, dependencies
  tasks.md     # Checklist of remaining work

These survive context resets and session boundaries. When you hit 15% remaining context, update the dev docs, compact, and say “continue” — Claude picks up where it left off.

graph LR
    CLAUDE["CLAUDE.md\n(conventions)"] --> SESSION["Claude Session"]
    HOOKS["Hooks\n(enforcement)"] --> SESSION
    SKILLS["Skills\n(playbooks)"] --> SESSION
    MEMORY["Memory\n(learnings)"] --> SESSION
    DEVDOCS["Dev Docs\n(active context)"] --> SESSION

    SESSION --> CODE["Code Changes"]
    SESSION --> UPDATE["Update Docs"]
    UPDATE --> CLAUDE
    UPDATE --> SKILLS
    UPDATE --> DEVDOCS

    style CLAUDE fill:#264653,stroke:#264653,color:#fff
    style HOOKS fill:#e76f51,stroke:#e76f51,color:#fff
    style SKILLS fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style MEMORY fill:#e9c46a,stroke:#e9c46a,color:#000
    style DEVDOCS fill:#f4a261,stroke:#f4a261,color:#000
    style SESSION fill:#40916c,stroke:#40916c,color:#fff
    style CODE fill:#2d6a4f,stroke:#2d6a4f,color:#fff
    style UPDATE fill:#2d6a4f,stroke:#2d6a4f,color:#fff

Practical Setup Guide

Step 1: Start with CLAUDE.md. Document your build commands, test commands, coding conventions. Keep it under 200 lines. Commit it. This alone is worth doing even if you never use Claude Code — it’s the onboarding doc your team already needs.

Step 2: Add rules for specifics. Split into .claude/rules/ by topic. Use path-specific rules for different parts of the codebase.

Step 3: Add hooks for enforcement. Start with Stop hooks for formatting and type checking. Add PreToolUse guardrails for things that should never happen — editing .env files, running destructive commands.

Step 4: Build skills for repeating workflows. Deploy procedures, PR review checklists, debugging runbooks.

Step 5: Let memory accumulate naturally. Don’t pre-populate it. Let Claude learn your project’s quirks over time.

The diet103/claude-code-infrastructure-showcase [2] repo is an excellent reference — a real-world extraction from 6 months of production use on a ~300K LOC TypeScript project. The author (u/JokeGold5455 on Reddit) spent $1,200 total over 6 months on the Max plan and shared their complete methodology [1].

The Meta-Insight

The infrastructure you build to make Claude Code effective is the same infrastructure that makes your team effective. CLAUDE.md is your coding standards doc. Hooks are your CI checks running locally. Skills are your team’s runbooks. Memory is institutional knowledge.

If you keep these up to date — and you will, because Claude breaks when they’re wrong — you’ve solved the documentation problem not by trying harder to write docs, but by making documentation a load-bearing part of your development workflow. The new developer who joins your team doesn’t just get a stale wiki page. They get a working system that actively guides their coding assistant toward the team’s conventions.

The coding assistant isn’t just a tool. It’s a forcing function for the practices you already know you should have.

How are you structuring your Claude Code setup for team workflows?

References:

[1] u/JokeGold5455. “Claude Code is a Beast — Tips from 6 Months of Hardcore Use.” r/ClaudeCode 2025.
[2] diet103. “claude-code-infrastructure-showcase.” GitHub.
[3] “Claude Code Overview.” Anthropic.
[4] “Claude Code — Memory, CLAUDE.md, and .claude/rules.” Anthropic.
[5] “Organize Instructions with .claude/rules.” Anthropic.
[6] “Claude Code — Hooks.” Anthropic.
[7] “Claude Code — Skills.” Anthropic.
[8] “Claude Code — Sub-agents.” Anthropic.
[9] “Claude Code — Settings.” Anthropic.
[10] “Claude Code — Context Window Management.” Anthropic.
[11] “Claude Code — Best Practices.” Anthropic.

Knowledge Graph RAG — The Promise of Structured Retrieval and the Hidden Cost of Building It

2026-04-03T00:00:00+00:00

My thesis was on knowledge graph embeddings, so when GraphRAG started trending I was genuinely excited. Finally, knowledge graphs getting the attention they deserve in the LLM era. But having lived in that world, I also know what people aren’t talking about: the cost of actually building and maintaining a knowledge graph from scratch.

What Is a Knowledge Graph?
Why Traditional RAG Falls Short on Global Questions
The GraphRAG Approach
The Hidden Cost of Building the Knowledge Graph
The Maintenance Problem — What Happens When Knowledge Changes
What About Knowledge Graphs for Source Code?
When to Use GraphRAG

What Is a Knowledge Graph?

A knowledge graph is a graph where nodes are entities and edges are relations between them. The canonical example is a triple: (Albert Einstein, bornIn, Ulm). Millions of these triples form a structured representation of knowledge — Wikidata has over 100 billion triples. The key property is that knowledge is explicit and traversable. You can follow links, reason over paths, and answer multi-hop questions that would be impossible with flat text.

graph LR
    E["Einstein"] -->|bornIn| U["Ulm"]
    E -->|field| P["Physics"]
    E -->|won| NP["Nobel Prize 1921"]
    NP -->|category| P
    U -->|country| DE["Germany"]

    style E fill:#264653,stroke:#264653,color:#fff
    style U fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style P fill:#e9c46a,stroke:#e9c46a,color:#000
    style NP fill:#f4a261,stroke:#f4a261,color:#000
    style DE fill:#2d6a4f,stroke:#2d6a4f,color:#fff

Why Traditional RAG Falls Short on Global Questions

Traditional RAG works like this: chunk documents, embed them into vectors, retrieve the top-k most similar chunks for a query. It works well for factual lookups — “what does this API return?” or “what’s the refund policy?” But it falls apart on questions that require synthesizing information across many documents. Try asking “what are the main themes in this dataset?” or “how are these companies connected?” — vector similarity doesn’t help because no single chunk contains the answer.

The GraphRAG Approach

This is exactly what GraphRAG (Edge et al., 2024) [1] addresses. The core idea: build a knowledge graph from your corpus, detect communities of related entities using the Leiden algorithm [3], generate summaries for each community, then use those summaries to answer global questions via map-reduce.

graph LR
    DOC["Documents"] --> CHUNK["Chunk"]
    CHUNK --> EXT["Extract Entities"]
    EXT --> KG["Knowledge Graph"]
    KG --> COM["Detect Communities"]
    COM --> SUM["Summarize Communities"]
    SUM --> QA["Map-Reduce QA"]

    style DOC fill:#264653,stroke:#264653,color:#fff
    style CHUNK fill:#e76f51,stroke:#e76f51,color:#fff
    style EXT fill:#f4a261,stroke:#f4a261,color:#000
    style KG fill:#e9c46a,stroke:#e9c46a,color:#000
    style COM fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style SUM fill:#40916c,stroke:#40916c,color:#fff
    style QA fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The results are compelling. On their benchmarks (~1M token datasets), GraphRAG outperformed vector RAG on comprehensiveness (72-83% win rate) and diversity (62-82% win rate) using LLM-as-judge evaluation. Vector RAG still won on directness — it gives more concise, pointed answers for specific questions. This makes sense: vector RAG is great at finding the needle, GraphRAG is great at describing the haystack.

Metric	GraphRAG vs Vector RAG	What it means
Comprehensiveness	72-83% win	Covers more aspects of the answer
Diversity	62-82% win	Provides more varied perspectives
Directness	Vector RAG wins	More concise for specific questions
Empowerment	Mixed	Depends on whether quotes or summaries help more

The Hidden Cost of Building the Knowledge Graph

Here’s what it actually takes to go from documents to a usable knowledge graph:

Step 1: Entity and Relationship Extraction. An LLM reads every chunk and extracts entities and their relationships. The Neo4j implementation [2] using LangChain’s LLMGraphTransformer with GPT-4o extracted ~13,000 entities and ~16,000 relationships from 2,000 news articles. Cost: ~$30, time: ~35 minutes with 10 parallel workers. And this is a small dataset.

# LangChain's LLMGraphTransformer — simplified
from langchain_experimental.graph_transformers import LLMGraphTransformer

transformer = LLMGraphTransformer(llm=ChatOpenAI(model="gpt-4o"))
graph_documents = transformer.convert_to_graph_documents(documents)
# Each document → nodes (entities) + relationships (edges)

Step 2: Entity Resolution. This is the step the original paper mentions but doesn’t ship code for — and it’s arguably the hardest. The same entity appears with different names: “Silicon Valley Bank”, “Silicon_Valley_Bank”, “SVB”. You need to deduplicate them. The Neo4j blog’s approach: compute text embeddings, build a KNN graph (cosine similarity > 0.95), find connected components, filter by edit distance, then LLM verification for final merge decisions. Even with all that, it still has failure modes for dates, abbreviations, and domain-specific terms.

Step 3: Community Detection. Run the Leiden algorithm to partition the graph into hierarchical communities — clusters of closely related entities. The paper’s podcast dataset produced communities ranging from 34 (coarsest) to 1,310 (finest). Not every level adds meaningful information — the Neo4j blog found levels 3 and 4 differed by only 4 communities.

Step 4: Community Summarization. An LLM generates natural language summaries for each community, bottom-up through the hierarchy. That’s potentially thousands of LLM calls. The paper’s indexing step took 281 minutes for ~1M tokens of source documents.

Pipeline Step	What it does	Cost/Complexity
Entity Extraction	LLM reads every chunk	~$30 for 2K articles (GPT-4o)
Entity Resolution	Deduplicate entities	Multi-step pipeline, domain-dependent
Community Detection	Cluster related entities	Needs graph DB (Neo4j [5] + GDS plugin)
Community Summarization	LLM summarizes each community	Potentially thousands of LLM calls
Total indexing time	End to end	~281 min for ~1M tokens (paper)

The Maintenance Problem — What Happens When Knowledge Changes

If you have a fixed, curated knowledge graph — like Wikidata or a domain-specific ontology that rarely changes — GraphRAG works beautifully. The graph is your ground truth, you run community detection once, generate summaries, and you’re done. Query-time is efficient: root-level community summaries use 9-43x fewer tokens than processing raw text.

But if you’re building the knowledge graph from your own documents — which is the whole point of GraphRAG for most use cases — every update is painful:

Adding new documents means re-extracting entities, but now you need to resolve them against the existing graph. Is “OpenAI” in the new document the same “OpenAI” already in the graph? Probably yes, but what about “GPT-5” vs “GPT 5” vs “the new GPT model”? Every new entity needs link prediction against the full graph.

Entity deduplication gets harder as the graph grows. With 13,000 entities, pairwise comparison is already expensive. At 100K+ entities, you need approximate methods (LSH, blocking strategies), each with its own failure modes.

Community structure shifts. Adding a few hundred nodes can completely reorganize communities, invalidating existing summaries. Do you re-run Leiden on the full graph? Only on affected subgraphs?

Summary staleness. Even if you detect which communities changed, regenerating summaries means more LLM calls. And if higher-level summaries depend on lower-level ones (they do — the paper uses bottom-up summarization), a change at the leaf level cascades through the entire hierarchy.

graph LR
    NEW["New Documents"] --> EXT["Re-extract Entities"]
    EXT --> RES["Resolve Against\nExisting Graph"]
    RES --> LINK["Link Prediction"]
    LINK --> DEDUP["Entity Dedup\n(full graph)"]
    DEDUP --> RECOM["Re-detect\nCommunities"]
    RECOM --> RESUM["Re-summarize\n(cascade)"]

    style NEW fill:#e76f51,stroke:#e76f51,color:#fff
    style EXT fill:#f4a261,stroke:#f4a261,color:#000
    style RES fill:#e9c46a,stroke:#e9c46a,color:#000
    style LINK fill:#e9c46a,stroke:#e9c46a,color:#000
    style DEDUP fill:#f4a261,stroke:#f4a261,color:#000
    style RECOM fill:#e76f51,stroke:#e76f51,color:#fff
    style RESUM fill:#e76f51,stroke:#e76f51,color:#fff

This is the fundamental tension: GraphRAG converts unstructured data into structured data, and structured data is harder to update than unstructured data. With vector RAG, adding a new document is trivial — chunk it, embed it, append to the index. With GraphRAG, adding a new document means potentially restructuring your entire knowledge representation.

What About Knowledge Graphs for Source Code?

There’s another place where knowledge graphs have been proposed recently: source code. Projects like CodexGraph (Liu et al., 2024) [9] and GraphCoder [10] build knowledge graphs from codebases — extracting entities like functions, classes, and modules, with edges for calls, imports, inheritance, and type relationships — then use graph retrieval to give LLMs better repository-level context.

The idea sounds appealing: code is full of relationships, and understanding a function often means understanding what it calls, what calls it, and what types it uses. A knowledge graph could capture all of that.

But here’s my issue: code is already structured data. Unlike natural language documents, source code has a formal grammar. We already have tools that parse it perfectly:

Tool	What it provides	Maintenance cost
AST (tree-sitter)	Complete syntactic structure of every file	Zero — deterministic parse
LSP	Go-to-definition, find-references, call hierarchy, type info	Zero — runs on-demand from source
Package managers	Dependency graphs (pip, npm, cargo)	Zero — reads lockfiles
CodeQL / Semgrep	Data flow, taint tracking, control flow graphs	Near-zero — static analysis

These tools give you the exact same entities and relations that a code knowledge graph would extract — functions, classes, call graphs, import chains, type hierarchies — but with perfect accuracy, zero LLM cost, and no maintenance burden. An AST is a lossless representation of code structure. An LLM-extracted knowledge graph is a lossy, probabilistic approximation of the same thing.

Claude Code’s LSP integration [11] is a good example: it can jump to definitions, find all references, and traverse call hierarchies in real time, directly from the source. No graph database, no entity extraction pipeline, no community detection. Just the language server doing what it was designed to do.

graph LR
    CODE["Source Code"] --> AST["AST\n(tree-sitter)"]
    CODE --> LSP["LSP\n(definitions, refs)"]
    CODE --> DEP["Dependency\nGraph"]
    CODE --> SA["Static Analysis\n(CodeQL)"]

    KG_APPROACH["LLM-Extracted\nCode KG"]

    AST -->|"lossless, free"| SAME["Same Relations"]
    LSP -->|"real-time, free"| SAME
    DEP -->|"exact, free"| SAME
    SA -->|"precise, free"| SAME
    KG_APPROACH -->|"lossy, $$$"| SAME

    style CODE fill:#264653,stroke:#264653,color:#fff
    style AST fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style LSP fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style DEP fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style SA fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style KG_APPROACH fill:#e76f51,stroke:#e76f51,color:#fff
    style SAME fill:#e9c46a,stroke:#e9c46a,color:#000

And the maintenance problem is even worse for code than for documents. Codebases change constantly — every commit modifies functions, adds files, changes call patterns. If maintaining a knowledge graph for a slowly-changing document corpus is already painful, imagine maintaining one for a codebase with dozens of commits per day. Every refactor, every rename, every new dependency would require re-extraction and re-resolution.

The one argument for code KGs is cross-repository or cross-language reasoning — “which services depend on this shared library?” or “how does the Python backend connect to the TypeScript frontend?” LSP doesn’t cross language boundaries, and package managers don’t trace internal function calls across repos. But even here, tools like Sourcegraph [12] solve this with SCIP-based code intelligence [13] — deterministic, not probabilistic.

My take: knowledge graphs make sense when you’re dealing with unstructured data that has no inherent structure. Documents, research papers, news articles — these genuinely benefit from having structure imposed on them. But code already has structure. Building a knowledge graph on top of code is building a lossy approximation of something you can already access losslessly. It’s solving a problem that’s already solved.

When to Use GraphRAG

Scenario	Recommendation	Why
Static knowledge base, global queries	GraphRAG	Upfront cost amortized, global queries excel
Rapidly changing documents	Vector RAG	Update cost too high for GraphRAG
Specific factual lookups	Vector RAG	No need for global synthesis
Existing curated KG (Wikidata, domain ontology)	KG-augmented RAG	Skip the construction step entirely
Mixed: some global, some specific	Hybrid	Vector for specific, graph for thematic

The honest answer for most teams: if you already have a knowledge graph, absolutely use it for retrieval. If you need to build one from scratch for GraphRAG, think very carefully about whether the maintenance cost is worth the improvement over vector RAG. The benchmarks are real — GraphRAG genuinely outperforms on global questions. But benchmarks run on static datasets. Production systems don’t stay static.

Tools like LangChain’s LLMGraphTransformer [4], Neo4j [5], and Microsoft’s GraphRAG library [6] have made the initial construction more accessible. But accessible construction doesn’t mean accessible maintenance.

My take: the future of GraphRAG is in incremental graph updates — methods that can add new knowledge without restructuring the entire graph. Some work is happening here (incremental community detection [7], streaming knowledge graph construction [8]), but it’s still early. Until incremental updates are solved, GraphRAG is best suited for corpora that change infrequently and are queried frequently with global, thematic questions.

What’s your experience with knowledge graphs in production? Are you building them from scratch or leveraging existing ones?

References:

[1] Edge et al. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv 2024.
[2] Bratanic, T. “Implementing ‘From Local to Global’ GraphRAG with Neo4j and LangChain.” Neo4j Blog 2024.
[3] Traag, V. A. et al. “From Louvain to Leiden: guaranteeing well-connected communities.” Scientific Reports 2019.
[4] “How to construct knowledge graphs.” LangChain Documentation.
[5] “Neo4j Graph Database.” Neo4j.
[6] “GraphRAG.” Microsoft GitHub.
[7] Banerjee, P. et al. “Incremental Community Detection in Distributed Dynamic Graph.” arXiv 2023.
[8] Chuang, Y. et al. “Streaming Knowledge Graph Construction.” arXiv 2023.
[9] Liu et al. “CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases.” arXiv 2024.
[10] Liu et al. “GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model.” arXiv 2024.
[11] “Claude Code Overview.” Anthropic.
[12] “Sourcegraph — Code Intelligence Platform.” Sourcegraph.
[13] “Announcing SCIP — a better code indexing format.” Sourcegraph Blog.

Prompt Engineering — Why It Works, Not Just How

2026-04-03T00:00:00+00:00

There are hundreds of posts about how to write better prompts. This isn’t one of them. This post is about why prompts work — what’s happening mathematically when you add a system prompt, give few-shot examples, or describe the problem context. Once you understand the mechanism, the “tips and tricks” become obvious consequences.

What an LLM Actually Does
Why System Prompts Change Everything
The Authority Hierarchy — Why System Prompts Are Special
Decoder-Only Architecture — Why It Matters
Why Few-Shot Learning Works — The Mathematics
Prompt Techniques — Each One Mapped to Its Mechanism
From Handcrafted to Automated — My Journey
Cross-Provider Comparison
Context Is the Mechanism — From Prompts to Claude Code
The Bottom Line

What an LLM Actually Does

When you send a prompt, every word gets split into tokens, each token gets mapped to an embedding vector, and these vectors flow through dozens of transformer layers — each with multi-head attention and feed-forward networks — until the model produces a probability distribution over the entire vocabulary for the next token. It picks one (with some randomness), appends it, and repeats.

Stephen Wolfram’s deep dive [1] frames this beautifully: the model has learned a compressed representation of the “linguistic manifold” — the lower-dimensional surface in token-space where meaningful text lives. Your prompt defines the starting point on this manifold, and the model follows the most likely trajectory forward.

This is fundamentally different from deterministic systems like linear regression. Even with fixed weights, multiple sources of randomness exist:

Source	What happens	Why it exists
Temperature sampling	Logits scaled by T before softmax: $P(\text{token}_i) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$	T=0 is greedy (repetitive). T>0 allows creative variation
Top-p sampling [2]	Select smallest token set whose cumulative probability exceeds p	Adapts to model confidence
Top-k sampling	Truncate to k highest-probability tokens, renormalize	Prevents sampling from nonsensical tail
Hardware non-determinism	GPU floating-point is non-associative: (a+b)+c ≠ a+(b+c)	Parallel matrix multiplications sum in different orders

The takeaway: an LLM is a probabilistic system. Every prompting technique is about shifting the probability distribution toward the outputs you want.

Why System Prompts Change Everything

The INSTRUCTOR paper (Su et al., 2022) [3] gives direct empirical evidence. They trained a single embedding model that produces different embeddings for the same text depending on a prefixed instruction. “The weather is nice today” embedded with “Represent the sentiment:” produces a completely different vector than with “Represent the topic:”. Same weights, same input, different geometric location in embedding space.

This happens because instruction tokens participate in self-attention with input tokens. The instruction acts as a learned projection selector — it tells the model which aspects of the input to focus on. A system prompt doesn’t just “bias” the output — it fundamentally changes the internal representations.

Think of it like an exam. If a student sees “Chapter 3: Thermodynamics — use formulas 3.1-3.4,” they immediately activate relevant knowledge and constrain the search space. The system prompt is the exam header.

The Authority Hierarchy — Why System Prompts Are Special

There’s a deeper reason system prompts work, beyond embeddings. Both OpenAI and Anthropic publish specs that define an authority hierarchy for messages:

Level	OpenAI Model Spec [24]	Anthropic Soul Document [25]
Highest	Platform (model spec rules)	Anthropic (hardcoded behaviors)
High	Developer (system prompt)	Operator (system prompt)
Medium	User (user messages)	User (user messages)
Low	Guideline (default behaviors)	Softcoded defaults
None	Tool outputs, quoted text	Untrusted content

These aren’t just documentation — they’re training documents. Anthropic’s soul document [25] (23,000 words, up from 2,700 in their 2023 constitution) defines Claude’s character during training via Constitutional AI. Anthropic even publishes the system prompts [27] used for claude.ai — you can see exactly what shapes Claude’s default behavior.

The hierarchy is baked in through RLHF/RLAIF. Developer/operator messages are binding constraints that user messages cannot override. OpenAI calls this a “chain of command.”

This also explains why structured prompts resist prompt injection. Quoted text, JSON, XML, and tool outputs have no authority by default. And Claude was specifically trained on XML data [28], making XML tags particularly effective — they activate learned patterns where content between matched tags has a clear semantic role.

Decoder-Only Architecture — Why It Matters

OpenAI and Anthropic use decoder-only transformers [20]. The system prompt, user message, and model response are all part of a single token sequence processed left-to-right:

[system tokens] [user tokens] [assistant tokens →→→ generated one at a time]

Each token attends to all previous tokens via causal masking. This is autoregressive generation:

\[P(x_1, \ldots, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, \ldots, x_{t-1})\]

This differs from BERT-style [21] models that use bidirectional attention with masked language modeling + next sentence prediction. RoBERTa [4] later showed NSP doesn’t help. Decoder-only won because: (1) causal masking naturally supports generation, (2) every token provides a training signal (vs. 15% in MLM), (3) better scaling behavior.

The implication: your prompt is literally part of the sequence being “generated.” The model treats it as the beginning of a text it’s continuing. This is why format matters — you’re writing the first chapter and asking the model to write the next one consistently.

Why Few-Shot Learning Works — The Mathematics

When you include examples in your prompt, the transformer effectively runs an optimization algorithm on those examples during its forward pass.

Akyürek et al. (2022) [5] showed that transformer layers implement algorithms equivalent to gradient descent within their forward pass. Von Oswald et al. (2022) [6] made this precise: a single self-attention layer can implement one step of gradient descent on in-context examples. Attention keys encode inputs, values encode prediction errors, and the weighted sum computes a gradient update. This isn’t a metaphor — it’s a mathematical equivalence.

Garg et al. (2022) [7] extended this: transformers match optimal algorithms for each function class — OLS for linear regression, Lasso for sparse regression — learned implicitly through pretraining.

The surprising finding: Min et al. (2022) [8] tested few-shot examples with random wrong labels. Performance dropped only modestly. What mattered most:

The input-label format/structure (what shape the answer should take)
The distribution of inputs (what domain we’re in)
The label space (what the possible outputs are)

Examples activate the right “task circuit” in the model. They’re more like a function signature than training data. Mechanistically, Olsson et al. (2022) [9] identified “induction heads” — attention patterns that copy patterns from earlier in the context and apply them to the query.

Prompt Techniques — Each One Mapped to Its Mechanism

Technique	What it does	Why it works (mechanism)	Best for
Chain of Thought [26]	“Think step by step”	Intermediate tokens become retrievable context, trading sequence length for computation depth	Multi-step reasoning, math
XML/Structured Tags	Wrap content in	Attention boundary signals from HTML/XML training data; separates instructions from data	Complex prompts, injection defense
Role Assignment	“You are an expert X”	Shifts conditional distribution: $P(\text{tokens} \mid \text{expert}) \neq P(\text{tokens} \mid \text{generic})$	Domain-specific tasks
Diverse Few-Shot	3-5 varied examples	Triangulates the task by varying irrelevant dimensions; prevents overfitting to surface features	Classification, extraction
Prompt Chaining	Break into subtask pipeline	Focused context per step; errors caught between steps instead of propagating	Complex multi-step tasks
Self-Consistency [23]	Sample N times, majority vote	Errors are random (different wrong answers), correct reasoning converges	Reasoning, math
Self-Critique	“Review your output for X”	Verification easier than generation; reading allows holistic attention over full output	Code review, fact-checking
Negative → Positive	“Don’t use jargon” → “Use plain language”	Attention has no negation operator; mentioning forbidden concepts activates them	Style, tone, format

Chain of Thought deserves special attention. Transformers are constant-depth computation graphs — each token gets the same number of layers. Without CoT, a model must compress multi-step reasoning into a single forward pass. With CoT, each intermediate result becomes retrievable context for subsequent computation.

Negative prompting is particularly interesting. When you write “Don’t mention competitors,” the tokens “mention” and “competitors” receive attention and activate related representations — increasing their probability. Anthropic [13] recommends positive framing: instead of “Don’t be verbose,” say “Respond in exactly 3 sentences.”

From Handcrafted to Automated — My Journey

When I worked on text2sql, the early approach was static few-shot: hardcode 3-5 example question-SQL pairs and hope they cover enough patterns. It worked for simple queries but fell apart on anything the examples didn’t closely resemble.

The next step was adaptive few-shot — a simple RAG system for examples. Embed all example pairs, retrieve the most similar ones per query. The intuition maps directly to the research: relevant examples activate the most relevant task circuits.

# Static few-shot — same examples for every query
prompt = f"""Convert to SQL:
Q: How many users? SQL: SELECT COUNT(*) FROM users
Q: List all orders SQL: SELECT * FROM orders
Q: {user_query} SQL:"""

# Adaptive few-shot — retrieve relevant examples per query
similar_examples = vector_store.search(user_query, top_k=3)
prompt = f"""Convert to SQL:
{format_examples(similar_examples)}
Q: {user_query} SQL:"""

Later, Anthropic released their Prompt Improver [10] — a tool that restructures prompts with XML tags, chain-of-thought instructions, and enhanced examples. OpenAI [11] has their Prompt Optimizer. Google [12] provides detailed documentation but no automated tool.

Now with models like Claude Sonnet 4.6, my workflow is: describe the problem, give examples, pull the official guidance [13], and let the model iterate. The models write better prompts than I do. But this only works because the models are now capable enough.

Cross-Provider Comparison

Aspect	Anthropic [13]	OpenAI [11]	Google [12]
Signature technique	XML tags for structure	Delimiters + role hierarchy	Few-shot always recommended
Reasoning control	Adaptive thinking with effort parameter	Reasoning models think internally	Explicit planning + self-critique
Prompt optimization	Prompt Improver [10]	Prompt Optimizer	AI Studio (manual)
Long context	Data at top, query at bottom	RAG + reference text	Context first, questions at end

Despite different approaches, all three converge on the same core: be specific, provide examples, structure input, give context. This makes sense — all three use decoder-only transformers governed by the same mechanisms.

Context Is the Mechanism — From Prompts to Claude Code

Before the LLM era, using Google effectively required the same skill: state your problem clearly, constrain the scope, evaluate results. “My code doesn’t work” returns garbage. “Python pandas merge KeyError left_on column not found” returns the exact answer. Input quality determines output quality.

Context is the mechanism. Without it, the model operates in its prior — the average of everything it’s seen. With context, you narrow the search space to where useful answers live.

This is why context management in Claude Code [14] matters so much. Claude Code’s system prompt is a modular, conditionally-assembled system [29] — ~2.5K tokens for the base prompt, 14-17K for tool definitions, plus CLAUDE.md, rules, memory, and skills layered on top. When it reads your CLAUDE.md [15], loads rules [16], and checks memory [15] — it’s building a prompt. The context window [17] is the constraint. This is why best practices [18] recommend keeping CLAUDE.md under 200 lines and using path-specific rules. It’s prompt engineering at the infrastructure level.

graph LR
    SYSTEM["System Prompt\n(architecture)"] --> CONTEXT["Context Window"]
    CLAUDE_MD["CLAUDE.md\n(conventions)"] --> CONTEXT
    RULES["Rules\n(path-specific)"] --> CONTEXT
    MEMORY["Memory\n(learned)"] --> CONTEXT
    FILES["Current Files\n(problem)"] --> CONTEXT

    CONTEXT --> ATTENTION["Multi-Head\nAttention"]
    ATTENTION --> OUTPUT["Output\nDistribution"]
    OUTPUT --> RESULT["Generated Code"]

    style SYSTEM fill:#264653,stroke:#264653,color:#fff
    style CLAUDE_MD fill:#264653,stroke:#264653,color:#fff
    style RULES fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style MEMORY fill:#e9c46a,stroke:#e9c46a,color:#000
    style FILES fill:#f4a261,stroke:#f4a261,color:#000
    style CONTEXT fill:#40916c,stroke:#40916c,color:#fff
    style ATTENTION fill:#e76f51,stroke:#e76f51,color:#fff
    style OUTPUT fill:#e9c46a,stroke:#e9c46a,color:#000
    style RESULT fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The Bottom Line

Prompt engineering isn’t a bag of tricks. It’s applied understanding of how transformers process sequences. Every technique that works is a consequence of the architecture — embeddings, attention, and next-token prediction. Understanding the architecture means you can invent new techniques when existing ones don’t fit your problem.

What’s your approach — do you still handcraft, or have you moved to letting the model iterate?

References:

[1] Wolfram, S. “What Is ChatGPT Doing … and Why Does It Work?” 2023.
[2] Holtzman et al. “The Curious Case of Neural Text Degeneration.” ICLR 2020.
[3] Su et al. “One Embedder, Any Task: Instruction-Finetuned Text Embeddings.” ACL 2023.
[4] Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” 2019.
[5] Akyürek et al. “What learning algorithm is in-context learning?” ICLR 2023.
[6] Von Oswald et al. “Transformers Learn In-Context by Gradient Descent.” ICML 2023.
[7] Garg et al. “What Can Transformers Learn In-Context?” NeurIPS 2022.
[8] Min et al. “Rethinking the Role of Demonstrations.” EMNLP 2022.
[9] Olsson et al. “In-context Learning and Induction Heads.” 2022.
[10] “Prompt Improver.” Anthropic.
[11] “Prompt Engineering Guide.” OpenAI.
[12] “Prompting Strategies.” Google AI.
[13] “Prompt Engineering Overview.” Anthropic.
[14] “Claude Code Overview.” Anthropic.
[15] “Claude Code — Memory.” Anthropic.
[16] “Organize Instructions with .claude/rules.” Anthropic.
[17] “Claude Code — Context Window.” Anthropic.
[18] “Claude Code — Best Practices.” Anthropic.
[19] “Claude Code — Skills.” Anthropic.
[20] Vaswani et al. “Attention Is All You Need.” NeurIPS 2017.
[21] Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers.” NAACL 2019.
[22] Radford et al. “Language Models are Unsupervised Multitask Learners.” OpenAI 2019.
[23] Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023.
[24] “OpenAI Model Spec.” OpenAI 2025.
[25] “About Claude.” Anthropic.
[26] Wei et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022.
[27] “System Prompts — Release Notes.” Anthropic.
[28] “Use XML Tags to Structure Your Prompt.” Anthropic.
[29] “How Claude Code Works.” Anthropic.

Skills vs Custom Commands in Claude Code — When to Use Which

2026-04-03T00:00:00+00:00

If you’ve been building workflows in Claude Code, you’ve probably noticed two ways to create slash commands: Skills (.claude/skills//SKILL.md) and Custom Commands (.claude/commands/.md). They both create /name in the slash menu. They both accept $ARGUMENTS. So what’s the difference, and when should you use each?

The Short Answer
What They Share
Where Skills Pull Ahead
Supporting Files — The Most Practical Difference
Auto-Invocation — The Most Interesting Difference
Subagent Execution — Solving Context Bloat
Dynamic Context Injection
When to Still Use Custom Commands
Decision Flowchart
Migration Example
The Bottom Line

The Short Answer

Custom commands are the legacy format. Skills are the current standard and a strict superset. The official docs [1] are explicit about this — “Custom commands have been merged into skills. Your existing .claude/commands/ files keep working. Skills add optional features.” But “strictly better” doesn’t always mean “always use the complex option.”

Both are markdown files with optional YAML frontmatter. Both create slash commands. Both support $ARGUMENTS substitution. Both can live at project scope (.claude/) or personal scope (~/.claude/). If all you need is a simple prompt template — a /review that injects your team’s review checklist, a /deploy that runs a deployment script — both work identically:

# Works the same as .claude/commands/review.md
# OR .claude/skills/review/SKILL.md
---
name: review
description: Review code against team standards
disable-model-invocation: true
---

Review the current changes against these standards:
1. All functions have type annotations
2. No hardcoded secrets
3. Tests cover the happy path and one edge case

Where Skills Pull Ahead

The divergence starts when you need anything beyond a simple prompt template:

Capability	Custom Commands	Skills
Simple slash command	Yes	Yes
YAML frontmatter	Yes (subset)	Yes (full)
`$ARGUMENTS` substitution	Yes	Yes
Supporting files (templates, scripts)	No — single file only	Yes — full directory
Auto-invocation by Claude	No	Yes — description matching
Subagent execution (`context: fork`)	No	Yes
Tool access control (`allowed-tools`)	No	Yes
Dynamic context injection (!`cmd`)	No	Yes
Path-specific activation (`paths:`)	No	Yes
Model/effort override	No	Yes
Live discovery (edit without restart)	No	Yes
Invocation control	Limited	Full (`disable-model-invocation`, `user-invocable`)

Supporting Files — The Most Practical Difference

A custom command is a single .md file. A skill is a directory. This means a skill can include templates, examples, scripts, and reference docs alongside the main SKILL.md:

.claude/skills/api-design/
  SKILL.md              # Main instructions (<500 lines)
  resources/
    patterns.md         # API design patterns reference
    error-codes.md      # Standard error code catalog
    template.ts         # Starter template for new endpoints

Claude loads SKILL.md first, then pulls resource files on demand. This is the progressive disclosure pattern from u/JokeGold5455’s showcase [2] — keep the entry point under 500 lines and let Claude dig deeper only when needed. With custom commands, you’d have to cram everything into one file or reference files by path and hope Claude reads them.

Auto-Invocation — The Most Interesting Difference

Skills have a description field that Claude uses to decide whether to load the skill automatically. If your skill says description: Explains code with diagrams and analogies, and the user asks “how does this work?”, Claude may auto-load it without anyone typing /explain. Custom commands only activate when explicitly invoked.

This is powerful but comes with a caveat from practical experience [2]: auto-invocation isn’t 100% reliable. That’s why the UserPromptSubmit hook pattern exists — a hook that matches your prompt against keywords and injects skill suggestions deterministically. If you depend on auto-invocation for critical workflows, back it up with a hook.

You can also go the other direction: disable-model-invocation: true means only the user can trigger it (good for /deploy). user-invocable: false means only Claude can trigger it (background knowledge that shouldn’t clutter the slash menu).

Subagent Execution — Solving Context Bloat

context: fork is a skill-only feature that solves the context window problem. Heavy tasks — deep code research, large file analysis, comprehensive reviews — can bloat your main conversation context. With context: fork, the skill runs in an isolated subagent:

# .claude/skills/deep-research/SKILL.md
---
name: deep-research
description: Thoroughly research a codebase topic
context: fork
agent: Explore
allowed-tools: Read Grep Glob
---

Research $ARGUMENTS thoroughly:
1. Find all relevant files
2. Read and analyze the code
3. Summarize findings with specific file:line references

The subagent does the heavy lifting, returns a summary, and your main context stays clean.

Dynamic Context Injection

The !`command` syntax runs shell commands before Claude sees the prompt:

---
name: pr-review
description: Review the current pull request
---

## Current PR Context
- Diff: !`gh pr diff`
- Changed files: !`gh pr diff --name-only`
- PR description: !`gh pr view --json body -q .body`

Review against team standards...

The shell commands execute when the skill loads, and their output becomes part of the prompt. No custom commands equivalent exists.

When to Still Use Custom Commands

Almost never for new work. But two cases are reasonable:

Migration cost. If you have a .claude/commands/ directory full of working commands, there’s no urgency to migrate. They’ll keep working. Migrate when you need a skill-specific feature for that particular command.

Simplicity preference. If you’re building a quick personal command — a /scratch that opens a scratchpad, a /standup that formats your daily update — the single-file format is slightly more convenient than creating a directory with a SKILL.md inside it. The difference is trivial, but it’s there.

Decision Flowchart

graph LR
    START["New slash\ncommand"] --> Q1{"Need supporting\nfiles?"}
    Q1 -->|Yes| SKILL["Use Skill"]
    Q1 -->|No| Q2{"Need auto-\ninvocation?"}
    Q2 -->|Yes| SKILL
    Q2 -->|No| Q3{"Need subagent\nor tool control?"}
    Q3 -->|Yes| SKILL
    Q3 -->|No| Q4{"Team\nworkflow?"}
    Q4 -->|Yes| SKILL
    Q4 -->|No| EITHER["Either works\n(Skill preferred)"]

    style START fill:#264653,stroke:#264653,color:#fff
    style Q1 fill:#e9c46a,stroke:#e9c46a,color:#000
    style Q2 fill:#e9c46a,stroke:#e9c46a,color:#000
    style Q3 fill:#e9c46a,stroke:#e9c46a,color:#000
    style Q4 fill:#e9c46a,stroke:#e9c46a,color:#000
    style SKILL fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style EITHER fill:#f4a261,stroke:#f4a261,color:#000

Migration Example

A custom command at .claude/commands/deploy.md:

---
description: Deploy to environment
disable-model-invocation: true
---

Deploy $ARGUMENTS:
1. Run tests
2. Build
3. Deploy to $1

The skill version at .claude/skills/deploy/SKILL.md:

---
name: deploy
description: Deploy to staging or production environment
disable-model-invocation: true
allowed-tools: Bash(npm *) Bash(./scripts/deploy*)
---

Deploy to $1 environment:
1. Run test suite: `npm test`
2. Build: `npm run build`
3. Check for uncommitted changes
4. If deploying to production, require explicit confirmation
5. Run: `./scripts/deploy.sh $1`
6. Verify health check

## Environment configs
!`cat ./deploy/environments.json`

The skill version adds tool access control (only specific bash commands allowed), dynamic context injection (environment configs loaded at invocation time), and lives in a directory where you could add a resources/runbook.md later.

If a skill and a custom command share the same name, the skill takes precedence. So you can migrate incrementally — create the skill, verify it works, then delete the old command file.

The Bottom Line

Custom commands are training wheels. They got the concept right — markdown files that create slash commands — but skills are the evolved version with the full feature set. For new work, default to skills. For existing commands, migrate when you need a feature that commands can’t provide.

The real power isn’t in either format individually. It’s in combining skills with hooks [3] for deterministic activation, with CLAUDE.md [4] for conventions, and with memory [4] for learned context. Skills are one piece of the knowledge layer I described in my previous post [5] — the playbook component that turns tribal knowledge into executable workflows.

What’s your experience been with skills vs commands? Have you found cases where the simpler format is genuinely better?

References:

[1] “Claude Code — Skills.” Anthropic.
[2] diet103. “claude-code-infrastructure-showcase.” GitHub.
[3] “Claude Code — Hooks.” Anthropic.
[4] “Claude Code — Memory, CLAUDE.md, and .claude/rules.” Anthropic.
[5] Dang, Q. “Claude Code as Your Team’s Knowledge Layer.” Community Contributor Posts 2026.

Why I Chose arq and RQ Over Celery for LLM Workloads

2026-04-02T00:00:00+00:00

If you’re building LLM-powered applications with FastAPI, you need a task queue. LLM API calls are slow — 2 to 30 seconds per request. You can’t block your web server on that. But the default answer in the Python world has always been Celery, and for LLM workloads, Celery is overkill.

LLM Workloads Are I/O Bound
Celery vs RQ vs arq
Memory Footprint
Rate Limiting LLM APIs
Why I Use Both arq and RQ
FastAPI Integration
When to Actually Use Celery
The Bottom Line

LLM Workloads Are I/O Bound

The first thing to understand is that LLM workloads are fundamentally I/O bound. You’re not doing heavy computation — you’re waiting for an HTTP response from OpenAI, Anthropic, or your self-hosted model. The CPU is idle while you wait. This changes everything about what you need from a task queue.

Celery was designed for a different world. It was built for CPU-bound tasks — image processing, data crunching, report generation. It uses multiprocessing by default, spawning separate OS processes for each worker. That makes sense when you need CPU isolation. But for I/O-bound LLM calls, you’re paying the memory overhead of multiple processes just to… wait on network responses.

Aspect	CPU-Bound (Celery’s sweet spot)	I/O-Bound (LLM calls)
Bottleneck	CPU computation	Network latency
Concurrency model	Multiprocessing (OS processes)	Async I/O or threading
Memory per worker	High (each process = full Python runtime)	Low (coroutines share one process)
Typical task duration	Milliseconds to seconds	2-30 seconds
Scaling strategy	More CPU cores	More concurrent connections

Celery vs RQ vs arq

Feature	Celery	RQ (Redis Queue)	arq
Broker	Redis, RabbitMQ, SQS, etc.	Redis only	Redis only
Concurrency	Multiprocessing, eventlet, gevent	Multiprocessing (1 task per worker)	Native async/await
Async support	No native async (gevent/eventlet as workaround)	No (sync only)	First-class
Dependencies	Heavy (~15 transitive deps)	Minimal (~3 deps)	Minimal (~2 deps)
Setup complexity	High (broker config, result backend, serializer, etc.)	Low	Low
Rate limiting	Built-in (per-task)	Manual	Manual (but async makes it natural)
Retry logic	Built-in, configurable	Built-in, basic	Built-in, configurable
Monitoring	Flower (separate service)	rq-dashboard	arq’s built-in health checks
Task routing	Advanced (multiple queues, priority)	Basic (named queues)	Basic (named queues)
Periodic tasks	Celery Beat (separate process)	rq-scheduler (separate)	Built-in cron support
Learning curve	Steep	Gentle	Gentle

Here’s what the setup looks like for each:

# Celery — lots of configuration
from celery import Celery

app = Celery('tasks', broker='redis://localhost:6379/0')
app.conf.update(
    result_backend='redis://localhost:6379/0',
    task_serializer='json',
    result_serializer='json',
    accept_content=['json'],
    task_routes={'tasks.score': {'queue': 'llm'}},
    task_rate_limit='10/m',
)

@app.task(bind=True, max_retries=3, retry_backoff=True)
def score_response(self, text):
    # This is sync — Celery runs it in a subprocess
    result = openai_client.chat.completions.create(...)
    return result.choices[0].message.content

# RQ — simple and straightforward
from redis import Redis
from rq import Queue

redis_conn = Redis()
q = Queue('llm', connection=redis_conn)

def score_response(text):
    # Plain sync function
    result = openai_client.chat.completions.create(...)
    return result.choices[0].message.content

# Enqueue
job = q.enqueue(score_response, text, retry=Retry(max=3, interval=60))

# arq — async-native, fits naturally with FastAPI
from arq import create_pool
from arq.connections import RedisSettings

async def score_response(ctx, text):
    # Native async — no thread pool, no subprocess
    result = await async_openai_client.chat.completions.create(...)
    return result.choices[0].message.content

class WorkerSettings:
    functions = [score_response]
    redis_settings = RedisSettings()
    max_jobs = 50  # 50 concurrent async tasks in ONE process

Notice the difference: arq [1] runs 50 concurrent LLM calls in a single process because they’re all just awaiting network I/O. Celery [3] would need 50 processes for the same concurrency. RQ [2] would need 50 worker processes.

One important note: Celery still has no native async/await support as of 2025. The async support issue (GitHub #6552) has been open since 2020 and keeps getting deferred. You can use gevent or eventlet as workarounds, or third-party packages like celery-aio-pool, but these are hacks around a fundamentally sync architecture. arq was built async from day one — by Samuel Colvin, the same person behind Pydantic.

Memory Footprint

The memory difference is significant in practice:

Setup	Concurrency	Memory usage	Processes
Celery (prefork, default)	50 tasks	~2.5 GB (50 × ~50 MB)	50
Celery (gevent)	50 tasks	~500 MB (1 process + greenlets)	1
RQ	50 tasks	~2.5 GB (50 × ~50 MB)	50
arq	50 tasks	~80 MB (1 process, 50 coroutines)	1

These are rough numbers, but the order of magnitude is real. When you’re deploying on a single VPS or a small Kubernetes pod, this matters.

Rate Limiting LLM APIs

Every LLM provider has rate limits [4] — requests per minute, tokens per minute, sometimes both. If you blast 100 concurrent requests, you’ll get 429 errors. You need to throttle.

Celery has built-in rate limiting (rate_limit='10/m'), but it’s per-worker, not global. If you have 5 workers each set to 10/minute, you’re actually doing 50/minute. You need a separate mechanism for global rate limiting.

With arq, since everything runs in one process with async, you can use a simple semaphore or token bucket:

import asyncio
from collections import deque
import time

class RateLimiter:
    def __init__(self, max_per_minute: int):
        self.max_per_minute = max_per_minute
        self.semaphore = asyncio.Semaphore(max_per_minute)
        self.timestamps: deque = deque()

    async def acquire(self):
        await self.semaphore.acquire()
        now = time.monotonic()
        # Clean old timestamps
        while self.timestamps and self.timestamps[0] < now - 60:
            self.timestamps.popleft()
            self.semaphore.release()
        self.timestamps.append(now)

rate_limiter = RateLimiter(max_per_minute=50)

async def score_response(ctx, text):
    await rate_limiter.acquire()
    result = await async_openai_client.chat.completions.create(...)
    return result

Because arq workers are single-process async, this in-process rate limiter actually works. With Celery’s multiprocessing, you’d need Redis-based distributed rate limiting — more complexity.

Why I Use Both arq and RQ

arq is my default for LLM API calls — scoring, summarization, embeddings, anything that’s an async HTTP call to an LLM provider. The async-native design means I get high concurrency with minimal resources, and it fits perfectly with FastAPI’s async ecosystem.

RQ I use for simpler background tasks that are sync by nature — sending emails, generating PDF reports, running database migrations, cleanup jobs. Tasks where I don’t need high concurrency and the simplicity of “just write a regular function” is the priority.

graph LR
    API["FastAPI"] --> R["Redis"]
    R --> ARQ["arq Worker"]
    R --> RQW["RQ Worker"]
    ARQ --> LLM["LLM APIs"]
    RQW --> SYNC["Sync Tasks"]

    style API fill:#264653,stroke:#264653,color:#fff
    style R fill:#e76f51,stroke:#e76f51,color:#fff
    style ARQ fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style RQW fill:#e9c46a,stroke:#e9c46a,color:#000
    style LLM fill:#2d6a4f,stroke:#2d6a4f,color:#fff
    style SYNC fill:#f4a261,stroke:#f4a261,color:#000

Both share the same Redis instance. FastAPI enqueues to whichever queue fits the task. No RabbitMQ, no Celery Beat process, no Flower monitoring server. Just Redis, which I already need for caching and session storage.

FastAPI Integration

The integration with FastAPI [6] is clean:

from fastapi import FastAPI
from arq import create_pool
from arq.connections import RedisSettings

app = FastAPI()

@app.on_event("startup")
async def startup():
    app.state.arq = await create_pool(RedisSettings())

@app.post("/score")
async def score(text: str):
    job = await app.state.arq.enqueue_job("score_response", text)
    return {"job_id": job.job_id}

@app.get("/score/{job_id}")
async def get_score(job_id: str):
    job = await app.state.arq.job(job_id)
    if await job.status() == "complete":
        return {"score": await job.result()}
    return {"status": "processing"}

No sync/async bridge. No thread pool executor wrapping. The whole stack is async end-to-end: FastAPI → Redis → arq → async LLM client [7].

When to Actually Use Celery

Celery isn’t dead — it’s just not the right tool for every job:

Use case	Best choice	Why
LLM API calls (scoring, summarization)	arq	Async I/O, high concurrency, low memory
Simple background jobs (email, cleanup)	RQ	Dead simple, sync is fine
CPU-heavy tasks (image processing, ML training)	Celery	Multiprocessing isolates CPU work
Complex workflows (chaining, fan-out, chord)	Celery	Built-in primitives for task composition
Multi-broker (RabbitMQ + Redis + SQS)	Celery	Only option with multi-broker support
Enterprise with existing Celery infra	Celery	Migration cost isn’t worth it

The pattern I’ve settled on: arq for I/O-bound LLM work, RQ for simple sync tasks, and Celery only if I genuinely need its workflow primitives or multi-broker support.

The Bottom Line

If you’re already running FastAPI + Redis (which most LLM apps are), arq adds almost zero operational complexity. It’s just another async process reading from the same Redis. Compare that to Celery, which wants its own broker, result backend, Beat scheduler, and Flower dashboard.

The LLM ecosystem is I/O-bound by nature. Your tools should reflect that.

What task queue setup are you using for LLM workloads? Have you found Celery worth the overhead, or have you moved to something lighter?

References:

[1] “arq — Job queues and RPC in python with asyncio and redis.” Samuel Colvin.
[2] “RQ: Simple job queues for Python.” RQ Project.
[3] “Celery — Distributed Task Queue.” Celery Project.
[4] “Rate Limiting.” OpenAI.
[5] “Anthropic API Rate Limits.” Anthropic.
[6] “FastAPI Background Tasks.” FastAPI.
[7] “asyncio — Asynchronous I/O.” Python.

How to Make LLM Output Consistent — Lessons from Building a Scoring System

2026-04-02T00:00:00+00:00

If you’ve worked with LLMs long enough, you’ve hit this problem: you run the same prompt twice and get different results. For a chatbot, that’s fine. For a scoring system where you need reliable, repeatable judgments? It’s a real problem.

I’ve worked on a project using LLM as a judge — a scoring system. Here’s everything I’ve learned about making LLM output consistent.

Temperature Is Not Enough
Audit Your Prompt for Conflicts
Detailed Rubrics Per Score Level
Ensemble: Multiple Calls, Aggregate
Chain-of-Thought Before Scoring
Known Biases in LLM Scoring
Putting It All Together

Temperature Is Not Enough

The first thing most people reach for is temperature. Set it to 0, problem solved, right? Not quite.

Temperature=0 means greedy decoding — the model always picks the highest-probability token. It’s the most deterministic setting available, but it’s not truly deterministic. GPU floating-point operations are inherently non-deterministic due to parallel reduction — different thread execution orders produce slightly different rounding, which can flip the result when two tokens have near-identical probabilities.

OpenAI introduced a seed parameter [8] in late 2023. When you set seed + temperature=0, they aim for deterministic outputs and return a system_fingerprint. But their docs explicitly say it’s “best effort.” Backend changes, model updates, load balancing across different hardware — all can break reproducibility. In practice, users report 85-95% reproducibility, not 100%.

Anthropic doesn’t expose a seed parameter at all. Temperature=0 with greedy decoding is the best you get.

Parameter	What it does	Deterministic?
temperature=0	Greedy decoding, always picks top token	Nearly, but GPU non-determinism remains
temperature=0 + seed (OpenAI)	Best-effort determinism with fingerprint tracking	~85-95% reproducible
top_p=1 + temperature=0	top_p has no effect at temp 0	Same as temperature=0
Low temperature (0.1-0.3)	Reduces randomness while keeping some diversity	No, but useful for ensembles

Bottom line: temperature helps, but alone it’s not enough for a reliable scoring system.

Audit Your Prompt for Conflicts

The second and most overlooked thing is prompt quality. If your instructions have contradictions or ambiguity, the model will be inconsistent — not because it’s random, but because it’s interpreting unclear guidance differently each time.

	Ambiguous Prompt	Clear Prompt
Criteria	“Score the quality”	“Score 1-5 based on accuracy, completeness, clarity”
Examples	None	2-3 anchor examples with scores and explanations
Score range	“Rate 0-10”	Explicit description per level (see below)
Result	Model interprets differently each call	Model follows consistent criteria
Consistency	Low	High

Check for conflicts between your system prompt and tool descriptions. If the system prompt says “be strict” and a tool description says “be lenient,” the model is stuck. Also check between your rubric criteria — if criterion A rewards brevity and criterion B rewards thoroughness, the model will oscillate.

Detailed Rubrics Per Score Level

The third technique is what made the biggest difference: detailed rubrics with per-score-level descriptions.

If you tell the model “score from 0 to 10,” you’ll get inconsistent results. The model’s idea of a 6 versus a 7 is fuzzy. But if you define exactly what each score range means, consistency improves dramatically.

The Prometheus paper (Kim et al., ICLR 2024) [4] showed this rigorously — providing explicit score-level descriptions significantly outperformed generic “rate from 1-5” prompts.

Technique	Impact on consistency
Detailed per-level rubric	High — the single most effective technique
2-3 anchor examples with explanations	High — few-shot calibration teaches the scale
Narrower scale (1-5 vs 1-10)	Medium — less ambiguity between adjacent scores
Independent sub-criteria scored separately	Medium — reduces conflation of different quality aspects
Boundary examples (“this is a 3, this is a 4 because…”)	High — resolves edge cases

Ensemble: Multiple Calls, Aggregate

The fourth technique is ensemble — instead of trusting a single call, run multiple calls and aggregate.

graph LR
    subgraph "Single Call"
        S1["One LLM call"] --> S2["Score: 7"]
    end
    subgraph "Ensemble (3 calls)"
        E1["Call 1: Score 7"] --> AGG["Aggregate"]
        E2["Call 2: Score 8"] --> AGG
        E3["Call 3: Score 7"] --> AGG
        AGG --> E4["Final: 7 (median)"]
    end

    style S2 fill:#e9c46a,stroke:#e9c46a,color:#000
    style E4 fill:#2a9d8f,stroke:#2a9d8f,color:#fff

Aggregation method	Best for	Notes
Mean	Continuous scores	Simple but outlier-sensitive
Median	Continuous scores	Robust to outliers, preferred
Majority vote	Categorical (pass/fail, A/B/C)	Best for discrete judgments
Trimmed mean	Continuous, high stakes	Drop highest and lowest, average the rest

3 calls captures most of the variance reduction. When ensembling, use a small positive temperature (0.2-0.3) — at temp 0 you’d get the same answer N times. Multi-model panels (GPT-4 + Claude + Gemini) reduce shared biases.

Chain-of-Thought Before Scoring

Chain-of-thought before scoring improves consistency significantly. The G-Eval paper [3] showed reasoning before scoring improved correlation with human judgments — Spearman from ~0.38 to ~0.51. The key: reasoning must come before the score, not after. Otherwise it’s post-hoc rationalization.

The optimal pattern: chain-of-thought reasoning + structured output for the final score.

Here’s what that looks like with Instructor:

from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

class EvaluationStep(BaseModel):
    criterion: str
    observation: str
    score: int = Field(ge=0, le=5)

class Evaluation(BaseModel):
    chain_of_thought: list[EvaluationStep] = Field(
        description="Evaluate each criterion BEFORE assigning final score"
    )
    final_score: int = Field(ge=0, le=10)
    summary: str

client = instructor.from_openai(OpenAI())
result = client.chat.completions.create(
    model="gpt-4o",
    response_model=Evaluation,
    temperature=0,
    messages=[
        {"role": "system", "content": RUBRIC},
        {"role": "user", "content": f"Evaluate: {response_text}"},
    ],
)

And for the ensemble:

import statistics

def score_with_ensemble(text, n_calls=3, temperature=0.2):
    scores = []
    for _ in range(n_calls):
        result = client.chat.completions.create(
            model="gpt-4o",
            response_model=Evaluation,
            temperature=temperature,
            messages=[
                {"role": "system", "content": RUBRIC},
                {"role": "user", "content": f"Evaluate: {text}"},
            ],
        )
        scores.append(result.final_score)
    return statistics.median(scores)

Known Biases in LLM Scoring

Be aware of known biases in LLM scoring:

Bias	What happens	Mitigation
Position bias	Prefers the first response in pairwise comparison	Swap order, average both results
Verbosity bias	Rates longer responses higher, even if redundant	Instruct judge to ignore length
Self-preference bias	Rates its own model’s output ~10% higher	Use a different model as judge
Format/style bias	Prefers markdown, bullet points over plain text	Normalize formatting before judging
Anchoring bias	Hints about expected quality skew the score	Remove metadata, anonymize outputs

Putting It All Together

timeline
    title Building a Consistent LLM Scoring System
    section Foundation
        Step 1 : Set temperature=0 or low (0.1-0.3 for ensemble)
                : Remove randomness as much as possible
    section Prompt Quality
        Step 2 : Audit prompt for conflicts and ambiguity
                : Ensure system prompt, tools, rubric are aligned
        Step 3 : Write detailed per-score-level rubric
                : Add 2-3 anchor examples with explanations
                : Use narrow scales (1-5) or decomposed sub-criteria
    section Reliability
        Step 4 : Chain-of-thought before scoring
                : Reasoning influences the score, not post-hoc
        Step 5 : Structured output for final score
                : JSON schema with score + reasoning fields
    section Robustness
        Step 6 : Ensemble 3-5 calls, aggregate by median
                : Consider multi-model panel for high stakes
        Step 7 : Monitor score distributions for drift over time
                : Model updates can shift calibration

Each layer adds consistency. You don’t need all of them for every use case — but for a production scoring system, I’d use at least steps 1-5.

What techniques are you using for LLM consistency? Have you run into the same issues?

References:

[1] Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023.
[2] Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023.
[3] Liu et al. “G-Eval: NLG Evaluation using GPT-4 with Chain-of-Thought and a Form-Filling Paradigm.” 2023.
[4] Kim et al. “Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models.” ICLR 2024.
[5] Wang et al. “Large Language Models are not Fair Evaluators.” ACL 2024.
[6] Chan et al. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate.” 2023.
[7] Wallace et al. “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.” OpenAI, 2024.
[8] “Text Generation — Seed Parameter.” OpenAI.

PDF Meets LLM — The Tools, Trade-offs, and Pricing of Document Processing

2026-04-02T00:00:00+00:00

PDF processing was one of the first things I worked on as an AI engineer. Back then it was all about OCR pipelines. Now with multimodal LLMs, you can send a document page as an image and ask the model to understand it. But that doesn’t mean OCR is dead — far from it.

Native vs Scanned — The First Decision
The PDF Processing Toolkit
Redaction Before External Processing
PDF to Image — The LLM Bridge
LLM Document Understanding — Provider Pricing
OCR Services — Traditional Extraction Pricing
LLM Extraction vs OCR — The Trade-off
OCR + LLM — The Best of Both Worlds
Decision Framework

Native vs Scanned — The First Decision

The first decision in any PDF pipeline: is the document native or scanned?

Native PDFs (digitally created) have embedded text — extract it directly, no OCR, no LLM, no cost. Scanned PDFs are just images in a PDF container — you need OCR or a multimodal LLM to read them.

graph LR
    PDF["PDF"] --> CHECK{"Native?"}
    CHECK -->|Yes| TEXT["Direct Text"]
    CHECK -->|No| IMG["Page Images"]
    IMG --> OCR["OCR Service"]
    IMG --> LLM["Multimodal LLM"]
    TEXT --> PIPE["Pipeline"]
    OCR --> PIPE
    LLM --> PIPE

    style PDF fill:#264653,stroke:#264653,color:#fff
    style CHECK fill:#e9c46a,stroke:#e9c46a,color:#000
    style TEXT fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style IMG fill:#e9c46a,stroke:#e9c46a,color:#000
    style OCR fill:#e76f51,stroke:#e76f51,color:#fff
    style LLM fill:#f4a261,stroke:#f4a261,color:#000
    style PIPE fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The PDF Processing Toolkit

For native PDFs, the Python ecosystem has solid tools:

Tool	Type	Best for	Notes
PyMuPDF [7] (fitz)	Python library	All-in-one (text + manipulation + rendering)	Fast C engine, no external deps
pikepdf [8]	Python library	Low-level PDF surgery, repair, linearization	Built on qpdf, handles corrupted PDFs
pypdf	Python library	Simple merge/split/encrypt	Pure Python, was PyPDF2, lightweight
ReportLab	Python library	Creating PDFs from scratch	Reports, invoices, charts
pdftk	CLI tool	Quick merge/split/rotate/encrypt	The classic, Java dependency
qpdf	CLI tool	Page manipulation, repair, linearization	Lightweight, no Java
Ghostscript	CLI tool	Compression, format conversion, rendering	Powerful but slow for large batches

PyMuPDF gives you plain text or line-by-line with bounding boxes (position, font, size) — critical for structured extraction where spatial position determines meaning. pikepdf repairs damaged PDFs and handles low-level surgery. pypdf is the lightweight option for simple merge/split.

# PyMuPDF — text with bounding boxes
import fitz
doc = fitz.open("document.pdf")
for page in doc:
    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if block["type"] == 0:
            for line in block["lines"]:
                for span in line["spans"]:
                    text, bbox = span["text"], span["bbox"]

# pikepdf — repair and decrypt
import pikepdf
pdf = pikepdf.open("damaged.pdf")
pdf.save("repaired.pdf")

# CLI tools for shell workflows
pdftk doc1.pdf doc2.pdf cat output merged.pdf
qpdf --empty --pages doc1.pdf 1-5 doc2.pdf 3-10 -- merged.pdf

My rule: PyMuPDF when I need text extraction + manipulation together. pikepdf for corrupted files. pypdf for minimal dependencies. pdftk/qpdf for shell one-liners.

Redaction Before External Processing

When dealing with PII, financial data, or medical records, redact before sending documents to any external service. PyMuPDF’s apply_redactions() actually removes underlying content — not just a black rectangle overlay. Some naive approaches just draw over text, which is still extractable. Redact first, extract second.

PDF to Image — The LLM Bridge

Converting pages to images is essential for feeding documents to multimodal LLMs or OCR services:

import fitz
doc = fitz.open("document.pdf")
page = doc[0]
pix = page.get_pixmap(dpi=300)
image_bytes = pix.tobytes("png")
# Send to any LLM or OCR service

Both Textract [1] and Azure Document Intelligence [2] support batch document processing, but can be slow for large docs. When you don’t need cross-page layout analysis, send pages individually as images for better parallelism and error handling.

LLM Document Understanding — Provider Pricing

Every major LLM provider supports image/document input, but pricing varies wildly. Important: you pay for both input tokens (the image) and output tokens (the extracted text). Most comparisons only show input cost, which is misleading.

Assuming ~500 output tokens per page when extracting text as markdown:

Provider	Model	Input $/M	Output $/M	Input tokens/page	Total per 1K pages
Google [3]	Gemini Flash 2.5	$0.30	$2.50	~250-500	~$1.35-1.40
OpenAI [5]	GPT-4o-mini	$0.15	$0.60	~765-1,105	~$0.41-0.47
OpenAI [5]	GPT-4o	$2.50	$10.00	~765-1,105	~$6.90-7.75
Anthropic [4]	Claude Haiku 4.5	$1.00	$5.00	~1,500-3,000	~$4.00-5.50
Anthropic [4]	Claude Sonnet 4.6	$3.00	$15.00	~1,500-3,000	~$12.00-16.50

OpenAI divides images into 512×512 tiles in high detail mode — 170 tokens/tile + 85 base. A typical page (~1024×1024) is ~765 tokens. Low detail: flat 85 tokens.

Anthropic extracts text AND converts each page to an image — you pay for both. A 50-page document can consume 75,000-150,000 tokens just in context.

Gemini treats each PDF page as one image with fixed token cost — the cheapest LLM option for document processing.

The OmniAI OCR benchmark [11] tested 9 providers on 1,000 documents. Gemini Flash achieved the best CER (15%) among multimodal LLMs, vs 25% for GPT-4o. Traditional OCR still leads on pure accuracy, but the gap has narrowed — especially for printed text.

OCR Services — Traditional Extraction Pricing

AWS Textract [1] (per 1,000 pages, US region):

Feature	First 1M pages	After 1M pages
Detect Text (OCR only)	$1.50	$0.60
Tables	$15.00	$10.00
Forms (key-value pairs)	$50.00	$30.00
Queries (custom questions)	$25.00	$15.00
Tables + Forms + Queries	$90.00	$55.00

Azure Document Intelligence [2] (per 1,000 pages):

Model	Price per 1,000 pages
Read (OCR text extraction)	$1.50
Layout (text + tables + structure)	$10.00
Prebuilt (invoices, receipts, IDs)	$10.00
Custom extraction	$25.00

Gemini Flash 2.5 at ~$1.35/1K is comparable to basic OCR ($1.50/1K) — but you get document understanding, not just raw text. GPT-4o-mini at ~$0.41/1K is the cheapest overall. Claude Sonnet at ~$12-16.50/1K is 8-10x more expensive than basic OCR.

LLM Extraction vs OCR — The Trade-off

Gemini’s document understanding [3] is impressive for the price:

import google.generativeai as genai

model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content([
    "Extract all text from this document page in markdown format.",
    {"mime_type": "image/png", "data": image_bytes}
])

But there’s a catch: hallucination. LLMs sometimes add content that isn’t there, misread numbers, or reformat in meaning-changing ways. OCR has no hallucination risk — it either reads the character correctly or it doesn’t.

OCR + LLM — The Best of Both Worlds

The approach that actually works best for information extraction: combine OCR and LLM. Instead of asking the LLM to both read and understand the document (image → LLM), split the responsibilities: OCR handles reading, LLM handles understanding.

graph LR
    IMG["Page Image"] --> OCR["OCR Service"]
    OCR --> TXT["Accurate Text"]
    TXT --> LLM["LLM"]
    LLM --> OUT["Structured Data"]

    style IMG fill:#264653,stroke:#264653,color:#fff
    style OCR fill:#e76f51,stroke:#e76f51,color:#fff
    style TXT fill:#e9c46a,stroke:#e9c46a,color:#000
    style LLM fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style OUT fill:#2d6a4f,stroke:#2d6a4f,color:#fff

The naive approach sends the image directly to an LLM — it does OCR and reasoning in one shot. When it fails, you don’t know which step failed. Was the text misread, or was the logic wrong?

# Naive: image → LLM (OCR + reasoning in one shot)
response = model.generate_content([
    "Extract invoice number, date, and total.",
    {"mime_type": "image/png", "data": image_bytes}
])
# Risk: misread characters, hallucinated fields

# Better: OCR → text → LLM (separated concerns)
ocr_text = textract_client.detect_document_text(image_bytes)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract structured data from this OCR text."},
        {"role": "user", "content": f"Extract invoice number, date, total:\n\n{ocr_text}"},
    ]
)

OCR gives you reliable text (no hallucination). LLM operates on text (which it’s great at) instead of pixels (where it stumbles). And text tokens are cheaper than image tokens.

Approach	OCR cost	LLM cost	Total per 1K pages	Accuracy
Image → LLM (naive)	$0	~$0.23-16.50	~$0.23-16.50	Moderate (hallucination risk)
OCR → LLM (combined)	$1.50	~$0.05-0.50	~$1.55-2.00	High (no vision errors)
OCR → LLM + structured output	$1.50	~$0.10-1.00	~$1.60-2.50	Highest (validated schema)

The sweet spot: basic OCR ($1.50/1K) + GPT-4o-mini for reasoning (~$1.55-2.00 total per 1K pages). For native PDFs, replace OCR with direct text extraction (free).

Decision Framework

Need	Approach	Cost per 1K pages	Why
Exact text from native PDFs	PyMuPDF / pypdf (direct)	Free	No OCR needed, perfect fidelity
Summarize or quick understanding	Image → Gemini Flash 2.5 or GPT-4o-mini	~$0.41-1.35	Cheap, good enough when exact text isn’t critical
Exact text from scanned docs	Textract or Azure (Read)	$1.50	Reliable OCR, no hallucination
Robust information extraction	OCR → LLM (text, not image)	~$1.55-2.00	Best trade-off: OCR accuracy + LLM reasoning
Table extraction	Textract or Azure (Layout)	$10-15	Structured output with positions
Complex understanding	Image → Claude Sonnet or GPT-4o	~$7-17	Best reasoning, most expensive
Forms and key-value pairs	Textract or Azure (Forms)	$10-50	Accurate but expensive
Compliance-critical	OCR + human review	$1.50-50	Zero hallucination risk

Always check if the PDF is native first. If it is, you get perfect text for free. For scanned documents, choose based on accuracy needs and budget — LLM for understanding, OCR for fidelity.

What’s your PDF processing stack? Are you using LLM-based extraction, or sticking with traditional OCR?

References:

[1] “Amazon Textract — Pricing.” AWS.
[2] “Azure Document Intelligence — Pricing.” Microsoft Azure.
[3] “Gemini Developer API — Pricing.” Google AI.
[4] “Vision — Claude API.” Anthropic.
[5] “Pricing.” OpenAI.
[6] “Images and Vision.” OpenAI.
[7] “PyMuPDF Documentation.” Artifex.
[8] “pikepdf Documentation.” pikepdf.
[9] “Amazon Textract — Features.” AWS.
[10] “Document Intelligence — Layout Model.” Microsoft Learn.
[11] “OmniAI OCR Benchmark.” OmniAI.

Prompt Caching — The Hidden Layer That Saves You Money and Time

2026-04-02T00:00:00+00:00

If you’re building LLM-powered applications and not thinking about prompt caching, you’re probably paying more than you need to. This is one of those features that doesn’t get enough attention compared to model capabilities, but it has a direct impact on cost and latency.

Let me walk through what I’ve learned.

Provider Comparison
Deep Dive: Each Provider
Practical Takeaways

Every time you make an API call to an LLM, you’re sending the full prompt: system instructions, tool definitions, conversation history, and the latest user message. For a multi-turn conversation with a detailed system prompt and 20+ tools, that prefix can be thousands of tokens — and you’re paying for all of them on every single call. In agentic workflows where the model makes multiple tool calls per turn, this adds up fast.

Prompt caching solves this. The idea is simple: if the beginning of your prompt hasn’t changed since the last call, don’t reprocess it. Cache it and reuse it.

Provider Comparison

Here’s how the three major providers compare:

Feature	Anthropic	OpenAI	Google Gemini
Launch	Aug 2024	Oct 2024	Jun 2024
Mode	Explicit (manual breakpoints)	Implicit (automatic)	Explicit (cached resource)
Min tokens	1,024 - 2,048	1,024	32,768
TTL	5 min (refreshes on hit)	~5-60 min (automatic)	Configurable (default 1hr)
Write cost	+25% surcharge	No surcharge	Standard
Read discount	90% off	50% off	~75% off
Max breakpoints	4 per request	N/A	N/A
Best for	Agentic workflows, many tools	Zero-config simplicity	Massive contexts (docs, codebases)

And here’s how the prompt layers map to caching priority — the most stable content sits at the top, the most variable at the bottom:

Layer	Stability	Cache behavior	Change =
1. System Prompt	Highest	Cached first	Invalidates everything
2. Tool Definitions	High	Cached after system	Invalidates tools + messages
3. Message History	Growing	Older messages cached	Only new messages re-processed
4. Latest User Message	None	Never cached	Changes every turn

graph LR
    A["System Prompt"] --> B["Tools"]
    B --> C["Messages"]
    C --> D["User Input"]

    style A fill:#2d6a4f,stroke:#1b4332,color:#fff
    style B fill:#40916c,stroke:#2d6a4f,color:#fff
    style C fill:#74c69d,stroke:#40916c,color:#000
    style D fill:#d8f3dc,stroke:#74c69d,color:#000

The green gradient shows stability: dark green (most stable, cached first) to light (most variable, never cached). Change anything early, and everything after it is invalidated.

Here’s when each provider launched:

timeline
    title Prompt Caching Timeline
    section Google
        Jun 2024 : Context Caching for Gemini
                 : Explicit cached resource
                 : Min 32,768 tokens
                 : Configurable TTL (default 1hr)
                 : Storage cost per hour
    section Anthropic
        Aug 2024 : Prompt Caching
                 : Manual cache_control breakpoints
                 : Min 1,024 tokens
                 : 5-min TTL (refreshes on hit)
                 : 90% read discount
                 : 25% write surcharge
    section OpenAI
        Oct 2024 : Automatic Prompt Caching
                 : Zero configuration
                 : Min 1,024 tokens
                 : 50% read discount
                 : No write surcharge

The cost impact over multiple requests — say you have a 5,000-token cached prefix (system + tools) and make 10 API calls:

Request	Anthropic (cached prefix cost)	OpenAI (cached prefix cost)
#1 (cold)	5,000 × 1.25x = 6,250 token-equivalents	5,000 × 1.0x = 5,000
#2 (warm)	5,000 × 0.1x = 500	5,000 × 0.5x = 2,500
#3-10 (warm)	500 each × 8 = 4,000	2,500 each × 8 = 20,000
Total (10 calls)	10,750 (vs 50,000 without caching)	27,500 (vs 50,000)
Savings	~78% off	~45% off

Anthropic’s higher write surcharge pays for itself after just 2 requests. By request 10, the 90% read discount dominates.

Now let me go deeper into each provider.

Deep Dive: Each Provider

Google was actually first to ship this, launching Context Caching for Gemini [5] in June 2024. But it’s designed for a different use case — very large contexts (minimum 32,768 tokens) that persist for hours. You create a cached resource explicitly and reference it across requests. It comes with a storage cost per hour, so it makes sense when you’re doing many requests against the same large document or codebase.

Anthropic introduced prompt caching [1] in August 2024, and for me this is where it got interesting. Their approach is manual and explicit. You mark specific points in your prompt with cache_control breakpoints. The system caches everything from the start of the prompt up to each breakpoint. On the next request, if the prefix up to a breakpoint is byte-for-byte identical, you get a cache hit.

The structure follows the natural order of a prompt:

First, the system prompt. This is the most stable part — your instructions, persona, rules. It sits at the very beginning and almost never changes between requests.

Second, tool definitions. If you have tools configured, their descriptions go next. These also tend to be stable across a conversation.

Third, messages. The conversation history, oldest first. As the conversation grows, the older messages form a stable prefix.

Fourth, the latest user message. This changes every turn, so it’s almost never cached.

Here’s what a well-structured Anthropic API request looks like with cache breakpoints:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 4096,
  "system": [
    {
      "type": "text",
      "text": "You are an AI assistant with access to many tools...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "tools": [
    {"name": "search", "description": "...", "input_schema": {"...": "..."}},
    {
      "name": "write_file",
      "description": "...",
      "input_schema": {"...": "..."},
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Find and fix the bug in auth.py"}
  ]
}

Notice: cache_control goes on the last item in each block — the last system text block, the last tool.

This ordering matters because caching is prefix-based. If you change something early — say you modify the system prompt — everything downstream is invalidated. That’s why you want the most stable content at the front and the most variable content at the end.

Anthropic’s pricing makes the economics clear: cache writes cost 25% more than normal input tokens, but cache reads are 90% cheaper. So you pay a small premium on the first call, then save dramatically on every subsequent call. For a long system prompt with many tools, the break-even is typically after 2-4 requests. After that, you’re saving 50-90% on input costs.

There are some constraints. You need at least 1,024 tokens for Claude 3.5 Sonnet and Opus, or 2,048 for Haiku. You get up to 4 breakpoints per request. And the cache has a 5-minute TTL that refreshes on each hit — so as long as requests keep coming, the cache stays warm.

The latency improvement is significant too. Anthropic reports up to 85% reduction in time-to-first-token for long prompts. In agentic workflows where the model might make 5-10 tool calls in a row, each one reusing the same system prompt and tools, this is a real difference.

OpenAI followed in October 2024 [3] with a different philosophy: automatic caching. No breakpoints, no configuration. The system detects when the first 1,024+ tokens of a prompt match a previous request and caches automatically, checking in 128-token increments after that.

The trade-off is different. OpenAI gives you a 50% discount on cache hits with no write surcharge. Less aggressive savings than Anthropic’s 90%, but also no upfront cost penalty. You just structure your prompts well and caching happens transparently.

OpenAI explicitly recommends the same prompt ordering — static content like system instructions and tool definitions at the beginning, variable content like user-specific data at the end. Same principle, just automated.

Practical Takeaways

The practical takeaway is about prompt architecture. Once you understand that caching is prefix-based, you start designing your prompts differently:

Keep your system prompt stable. Don’t inject dynamic data into it unless necessary. Version it carefully. Any change invalidates everything.

Put tool definitions before messages. Tools change less frequently than conversation content. If your tools are deterministically ordered (same order every request), the prefix stays stable through the tools layer.

Append, don’t rewrite. For multi-turn conversations, always append new messages. Don’t restructure the history. The older messages form a stable prefix that gets cached.

This is also why understanding the caching layer matters when you’re choosing between providers or optimizing costs. If you have long, stable system prompts with many tools (think: agentic applications), Anthropic’s 90% read discount is extremely aggressive. If you want zero-configuration simplicity and moderate savings, OpenAI’s automatic approach is easier to adopt. If you’re working with massive contexts (entire codebases, long documents), Google’s context caching with configurable TTL might be the right fit.

Most developers I talk to think about prompt engineering in terms of what to say. But how you structure the prompt — what goes where, what stays stable — is just as important for production systems. Caching turns prompt architecture into a cost and performance lever.

Are you using prompt caching in production? I’d love to hear how it’s affected your costs.

References:

[1] “Prompt Caching.” Anthropic.
[2] “Prompt Caching with Claude.” Anthropic Blog, Aug 2024.
[3] “Prompt Caching.” OpenAI.
[4] “API Prompt Caching.” OpenAI Blog, Oct 2024.
[5] “Context Caching.” Google Gemini.

Dquan’s LLM Notes

Harnesses Aren’t Portable — Why Each CLI Agent Has Its Own

Agent = Model + Harness

Three CLIs, Three Harnesses

The Tools Are Model-Coupled

Sandboxes Assume a Trust Profile

The Hard Evidence: 59.6%

Why This Matters in Practice

Anatomy of a Claude Code Session — What’s Built-in, What’s Configurable, and What You Control

The Full Stack at a Glance

Layer 1: The System Prompt

Layer 2: Built-in Tools

Layer 3: Environment Context

Layer 4: CLAUDE.md — The Highest-Leverage Layer

Layer 5: Settings (settings.json)

Layer 6: Hooks — Deterministic Automation (Opt-in)

Layer 7: Skills — Custom Slash Commands

Layer 8: MCP Servers — External Tool Integrations (Opt-in)

Layer 9: Memory — Claude’s Persistent Notes

How the Layers Compose

Practical Decision Framework

The Meta-Principle

Claude Code as Your Team’s Knowledge Layer — CLAUDE.md, Hooks, Skills, and the Onboarding Problem

The Mapping

CLAUDE.md — Your Coding Conventions, Version-Controlled

Hooks — Deterministic Enforcement, Not Suggestions

Skills — Your Team’s Playbooks

Memory — What Claude Tells Itself

The Documentation Lifecycle — Why This Actually Works

Practical Setup Guide

The Meta-Insight

Knowledge Graph RAG — The Promise of Structured Retrieval and the Hidden Cost of Building It

What Is a Knowledge Graph?

Why Traditional RAG Falls Short on Global Questions

The GraphRAG Approach

The Hidden Cost of Building the Knowledge Graph

The Maintenance Problem — What Happens When Knowledge Changes

What About Knowledge Graphs for Source Code?

When to Use GraphRAG

Prompt Engineering — Why It Works, Not Just How

What an LLM Actually Does

Why System Prompts Change Everything

The Authority Hierarchy — Why System Prompts Are Special

Decoder-Only Architecture — Why It Matters

Why Few-Shot Learning Works — The Mathematics

Prompt Techniques — Each One Mapped to Its Mechanism

From Handcrafted to Automated — My Journey

Cross-Provider Comparison

Context Is the Mechanism — From Prompts to Claude Code

The Bottom Line

Skills vs Custom Commands in Claude Code — When to Use Which

The Short Answer

What They Share

Where Skills Pull Ahead

Supporting Files — The Most Practical Difference

Auto-Invocation — The Most Interesting Difference

Subagent Execution — Solving Context Bloat

Dynamic Context Injection

When to Still Use Custom Commands

Decision Flowchart

Migration Example

The Bottom Line

Why I Chose arq and RQ Over Celery for LLM Workloads

LLM Workloads Are I/O Bound

Celery vs RQ vs arq

Memory Footprint

Rate Limiting LLM APIs

Why I Use Both arq and RQ

FastAPI Integration

When to Actually Use Celery

The Bottom Line

How to Make LLM Output Consistent — Lessons from Building a Scoring System

Temperature Is Not Enough

Audit Your Prompt for Conflicts

Detailed Rubrics Per Score Level

Ensemble: Multiple Calls, Aggregate

Chain-of-Thought Before Scoring

Known Biases in LLM Scoring

Putting It All Together

PDF Meets LLM — The Tools, Trade-offs, and Pricing of Document Processing