Prompt Caching
Prompt caching dramatically reduces cost and latency for agents that reuse the same large context across multiple turns — a long system prompt, a big tool list, or a growing conversation history.
Buddy AI builds caching in at the model layer, so you activate it with a single flag rather than manually inserting provider-specific headers.
Quick start
from buddy.agent import Agent
from buddy.models.anthropic import Claude
from buddy.models.openai import OpenAIChat
# Anthropic — explicit cache breakpoints injected automatically
agent = Agent(
model=Claude(id="claude-opus-4-5"),
cache_prompt=True,
)
# OpenAI — server-side automatic caching (no flag needed for caching itself,
# but setting cache_prompt=True surfaces cache hit/miss metrics)
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
cache_prompt=True,
)
That's it. No changes to your prompts, tools, or run loop.
How it works by provider
Anthropic (Claude)
Anthropic's Prompt Caching API lets you mark up to 4 cache breakpoints per request. Buddy automatically places them on the segments that change least often, in priority order:
| Breakpoint | Flag | Default |
|---|---|---|
| System prompt | cache_system |
✅ enabled |
| Tools list | cache_tools |
✅ enabled |
| Stable conversation history | cache_history |
✅ enabled |
A 5-minute ephemeral TTL is used by default. You can extend it to 1 hour:
from buddy.models.cache import PromptCacheConfig
agent = Agent(
model=Claude(id="claude-opus-4-5"),
cache_prompt=True,
model=Claude(
id="claude-opus-4-5",
prompt_cache_config=PromptCacheConfig(
enabled=True,
cache_system=True,
cache_tools=True,
cache_history=True,
ttl="1h", # requires claude-3-5-sonnet-20241022+
),
),
)
Minimum cacheable prefix: 1 024 tokens for Sonnet; 2 048 for Haiku. Short prompts won't be cached even with the flag set — Anthropic silently ignores breakpoints that fall below the minimum.
Cost savings
Anthropic charges ~10 % of the normal input-token price for a cache hit and ~125 % for the one-time cache write. On a 10 000-token system prompt with 20 turns you save roughly 90 % of prompt-token costs after the first turn.
OpenAI (GPT-4o, GPT-4o-mini, o-series)
OpenAI caches the longest matching prefix of any repeated request automatically — no client-side header required. The minimum cacheable prefix is 1 024 tokens and the cache key is based on exact token sequence.
Setting cache_prompt=True on an OpenAI agent:
- Does not change the API request (no
cache_controlheaders are sent). - Ensures
cached_tokensfrom the response usage details are captured inRunResponse.metricsandsession_metrics.
Tips for maximising OpenAI cache hits:
- Keep the system prompt and tool list stable across turns.
- Use a fixed
seedvalue to improve prefix consistency. - Minimise changes to early messages in the conversation history.
Fine-grained control with PromptCacheConfig
For cases where you only want to cache some segments, pass a
PromptCacheConfig directly to the model:
from buddy.models.cache import PromptCacheConfig
from buddy.models.anthropic import Claude
model = Claude(
id="claude-3-5-sonnet-20241022",
prompt_cache_config=PromptCacheConfig(
enabled=True,
cache_system=True, # cache the (often large) system prompt
cache_tools=False, # tools change per request — don't cache
cache_history=True, # cache stable history prefix
ttl="ephemeral", # "ephemeral" (5 min) or "1h"
),
)
agent = Agent(model=model, cache_prompt=True)
Reading cache metrics
Cache token counts are available in RunResponse.metrics:
response = agent.run("Summarise the document.")
metrics = response.metrics
print("Input tokens :", sum(metrics.get("input_tokens", [])))
print("Cache write tokens:", sum(metrics.get("cache_write_tokens", [])))
print("Cache read tokens :", sum(metrics.get("cached_tokens", [])))
Or on the cumulative session_metrics after multiple turns:
sm = agent.session_metrics
print(f"Total cached tokens this session: {sm.cached_tokens}")
print(f"Total cache writes this session : {sm.cache_write_tokens}")
Example: long multi-turn conversation
from buddy.agent import Agent
from buddy.models.anthropic import Claude
SYSTEM = """
You are an expert legal assistant. The following is the full text of the
contract you must analyse (50 pages, ~40 000 tokens):
[... full contract text ...]
"""
agent = Agent(
model=Claude(id="claude-opus-4-5"),
system_prompt=SYSTEM,
cache_prompt=True, # system + tools + history all cached
add_history_to_messages=True,
num_history_runs=10,
)
for question in [
"What are the termination clauses?",
"Summarise the payment terms.",
"Are there any non-compete provisions?",
]:
r = agent.run(question)
print(r.content)
# From turn 2 onward the 40 000-token system prompt is served from cache
# at ~10 % of the normal input price.
Compatibility matrix
| Provider | Automatic server-side | Explicit breakpoints | 1-hour TTL |
|---|---|---|---|
| Anthropic (Claude 3+) | ❌ | ✅ | ✅ (Sonnet 20241022+) |
| OpenAI (GPT-4o+) | ✅ | ❌ | N/A |
| AWS Bedrock (Claude) | ❌ | ✅ (inherits) | ✅ |
| Google Gemini | automatic context caching | — | session-based |
| Others | — | — | — |
FAQ
Does enabling cache_prompt change the model's output?
No. Cache hits return identical outputs to uncached calls — the cache stores
the processed KV representation of the tokens, not a generated response.
Will it slow down the first request? The very first request incurs a small cache write overhead (~125 % of normal input price on Anthropic). From the second request onward you save ~90 %.
What happens if the context changes between turns? Anthropic matches the longest unchanged prefix. If you add a new user message, the system + tools + history before it are still served from cache; only the new tokens are processed fresh.
Can I use cache_prompt in a streaming response?
Yes — invoke_stream / ainvoke_stream apply the same breakpoints.