Prompt Caching — The Hidden Layer That Saves You Money and Time
If you’re building LLM-powered applications and not thinking about prompt caching, you’re probably paying more than you need to. This is one of those features that doesn’t get enough attention compared to model capabilities, but it has a direct impact on cost and latency.
Let me walk through what I’ve learned.
Every time you make an API call to an LLM, you’re sending the full prompt: system instructions, tool definitions, conversation history, and the latest user message. For a multi-turn conversation with a detailed system prompt and 20+ tools, that prefix can be thousands of tokens — and you’re paying for all of them on every single call. In agentic workflows where the model makes multiple tool calls per turn, this adds up fast.
Prompt caching solves this. The idea is simple: if the beginning of your prompt hasn’t changed since the last call, don’t reprocess it. Cache it and reuse it.
Provider Comparison
Here’s how the three major providers compare:
| Feature | Anthropic | OpenAI | Google Gemini |
|---|---|---|---|
| Launch | Aug 2024 | Oct 2024 | Jun 2024 |
| Mode | Explicit (manual breakpoints) | Implicit (automatic) | Explicit (cached resource) |
| Min tokens | 1,024 - 2,048 | 1,024 | 32,768 |
| TTL | 5 min (refreshes on hit) | ~5-60 min (automatic) | Configurable (default 1hr) |
| Write cost | +25% surcharge | No surcharge | Standard |
| Read discount | 90% off | 50% off | ~75% off |
| Max breakpoints | 4 per request | N/A | N/A |
| Best for | Agentic workflows, many tools | Zero-config simplicity | Massive contexts (docs, codebases) |
And here’s how the prompt layers map to caching priority — the most stable content sits at the top, the most variable at the bottom:
| Layer | Stability | Cache behavior | Change = |
|---|---|---|---|
| 1. System Prompt | Highest | Cached first | Invalidates everything |
| 2. Tool Definitions | High | Cached after system | Invalidates tools + messages |
| 3. Message History | Growing | Older messages cached | Only new messages re-processed |
| 4. Latest User Message | None | Never cached | Changes every turn |
graph LR
A["System Prompt"] --> B["Tools"]
B --> C["Messages"]
C --> D["User Input"]
style A fill:#2d6a4f,stroke:#1b4332,color:#fff
style B fill:#40916c,stroke:#2d6a4f,color:#fff
style C fill:#74c69d,stroke:#40916c,color:#000
style D fill:#d8f3dc,stroke:#74c69d,color:#000
The green gradient shows stability: dark green (most stable, cached first) to light (most variable, never cached). Change anything early, and everything after it is invalidated.
Here’s when each provider launched:
timeline
title Prompt Caching Timeline
section Google
Jun 2024 : Context Caching for Gemini
: Explicit cached resource
: Min 32,768 tokens
: Configurable TTL (default 1hr)
: Storage cost per hour
section Anthropic
Aug 2024 : Prompt Caching
: Manual cache_control breakpoints
: Min 1,024 tokens
: 5-min TTL (refreshes on hit)
: 90% read discount
: 25% write surcharge
section OpenAI
Oct 2024 : Automatic Prompt Caching
: Zero configuration
: Min 1,024 tokens
: 50% read discount
: No write surcharge
The cost impact over multiple requests — say you have a 5,000-token cached prefix (system + tools) and make 10 API calls:
| Request | Anthropic (cached prefix cost) | OpenAI (cached prefix cost) |
|---|---|---|
| #1 (cold) | 5,000 × 1.25x = 6,250 token-equivalents | 5,000 × 1.0x = 5,000 |
| #2 (warm) | 5,000 × 0.1x = 500 | 5,000 × 0.5x = 2,500 |
| #3-10 (warm) | 500 each × 8 = 4,000 | 2,500 each × 8 = 20,000 |
| Total (10 calls) | 10,750 (vs 50,000 without caching) | 27,500 (vs 50,000) |
| Savings | ~78% off | ~45% off |
Anthropic’s higher write surcharge pays for itself after just 2 requests. By request 10, the 90% read discount dominates.
Now let me go deeper into each provider.
Deep Dive: Each Provider
Google was actually first to ship this, launching Context Caching for Gemini [5] in June 2024. But it’s designed for a different use case — very large contexts (minimum 32,768 tokens) that persist for hours. You create a cached resource explicitly and reference it across requests. It comes with a storage cost per hour, so it makes sense when you’re doing many requests against the same large document or codebase.
Anthropic introduced prompt caching [1] in August 2024, and for me this is where it got interesting. Their approach is manual and explicit. You mark specific points in your prompt with cache_control breakpoints. The system caches everything from the start of the prompt up to each breakpoint. On the next request, if the prefix up to a breakpoint is byte-for-byte identical, you get a cache hit.
The structure follows the natural order of a prompt:
First, the system prompt. This is the most stable part — your instructions, persona, rules. It sits at the very beginning and almost never changes between requests.
Second, tool definitions. If you have tools configured, their descriptions go next. These also tend to be stable across a conversation.
Third, messages. The conversation history, oldest first. As the conversation grows, the older messages form a stable prefix.
Fourth, the latest user message. This changes every turn, so it’s almost never cached.
Here’s what a well-structured Anthropic API request looks like with cache breakpoints:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 4096,
"system": [
{
"type": "text",
"text": "You are an AI assistant with access to many tools...",
"cache_control": {"type": "ephemeral"}
}
],
"tools": [
{"name": "search", "description": "...", "input_schema": {"...": "..."}},
{
"name": "write_file",
"description": "...",
"input_schema": {"...": "..."},
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Find and fix the bug in auth.py"}
]
}
Notice: cache_control goes on the last item in each block — the last system text block, the last tool.
This ordering matters because caching is prefix-based. If you change something early — say you modify the system prompt — everything downstream is invalidated. That’s why you want the most stable content at the front and the most variable content at the end.
Anthropic’s pricing makes the economics clear: cache writes cost 25% more than normal input tokens, but cache reads are 90% cheaper. So you pay a small premium on the first call, then save dramatically on every subsequent call. For a long system prompt with many tools, the break-even is typically after 2-4 requests. After that, you’re saving 50-90% on input costs.
There are some constraints. You need at least 1,024 tokens for Claude 3.5 Sonnet and Opus, or 2,048 for Haiku. You get up to 4 breakpoints per request. And the cache has a 5-minute TTL that refreshes on each hit — so as long as requests keep coming, the cache stays warm.
The latency improvement is significant too. Anthropic reports up to 85% reduction in time-to-first-token for long prompts. In agentic workflows where the model might make 5-10 tool calls in a row, each one reusing the same system prompt and tools, this is a real difference.
OpenAI followed in October 2024 [3] with a different philosophy: automatic caching. No breakpoints, no configuration. The system detects when the first 1,024+ tokens of a prompt match a previous request and caches automatically, checking in 128-token increments after that.
The trade-off is different. OpenAI gives you a 50% discount on cache hits with no write surcharge. Less aggressive savings than Anthropic’s 90%, but also no upfront cost penalty. You just structure your prompts well and caching happens transparently.
OpenAI explicitly recommends the same prompt ordering — static content like system instructions and tool definitions at the beginning, variable content like user-specific data at the end. Same principle, just automated.
Practical Takeaways
The practical takeaway is about prompt architecture. Once you understand that caching is prefix-based, you start designing your prompts differently:
Keep your system prompt stable. Don’t inject dynamic data into it unless necessary. Version it carefully. Any change invalidates everything.
Put tool definitions before messages. Tools change less frequently than conversation content. If your tools are deterministically ordered (same order every request), the prefix stays stable through the tools layer.
Append, don’t rewrite. For multi-turn conversations, always append new messages. Don’t restructure the history. The older messages form a stable prefix that gets cached.
This is also why understanding the caching layer matters when you’re choosing between providers or optimizing costs. If you have long, stable system prompts with many tools (think: agentic applications), Anthropic’s 90% read discount is extremely aggressive. If you want zero-configuration simplicity and moderate savings, OpenAI’s automatic approach is easier to adopt. If you’re working with massive contexts (entire codebases, long documents), Google’s context caching with configurable TTL might be the right fit.
Most developers I talk to think about prompt engineering in terms of what to say. But how you structure the prompt — what goes where, what stays stable — is just as important for production systems. Caching turns prompt architecture into a cost and performance lever.
Are you using prompt caching in production? I’d love to hear how it’s affected your costs.
References:
[1] “Prompt Caching.” Anthropic.
[2] “Prompt Caching with Claude.” Anthropic Blog, Aug 2024.
[3] “Prompt Caching.” OpenAI.
[4] “API Prompt Caching.” OpenAI Blog, Oct 2024.
[5] “Context Caching.” Google Gemini.