Token Efficiency
Buddy AI gives you several layered techniques to reduce token usage, API costs, and latency — from provider-native prompt caching to automatic context compression. All techniques compose: you can enable any combination.
Technique 1 — Prompt Caching
The fastest, cheapest win. See the dedicated Prompt Caching guide.
# Works on both Claude and GPT-4o — one flag
agent = Agent(model=Claude(id="claude-opus-4-5"), cache_prompt=True)
Technique 2 — Proxy-Aware Caching
When your company routes LLM traffic through a proxy (LiteLLM, OpenRouter, Azure AI Gateway), Buddy automatically injects the right cache headers so the proxy's own caching layer fires too — even for Claude models served through an OpenAI-compatible endpoint.
from buddy.models.openai import OpenAIChat
# Claude behind a LiteLLM proxy
model = OpenAIChat(
id="anthropic/claude-opus-4-5",
base_url="http://0.0.0.0:4000/v1",
api_key="sk-my-litellm-key",
proxy_type="litellm", # injects {"cache": {"no-cache": False}}
cache_prompt=True,
)
# Any model behind OpenRouter
model = OpenAIChat(
id="anthropic/claude-opus-4-5",
base_url="https://openrouter.ai/api/v1",
api_key="sk-or-v1-...",
proxy_type="openrouter", # injects X-OpenRouter-Cache header
cache_prompt=True,
)
# Azure AI Management
model = OpenAIChat(
id="gpt-4o",
base_url="https://my-gateway.azure-api.net/openai",
proxy_type="azure", # injects x-ms-cache-enabled header
cache_prompt=True,
)
Supported proxy_type values:
| Value | Cache mechanism | Reference |
|---|---|---|
"litellm" |
Redis / in-memory exact-match cache | LiteLLM caching docs |
"openrouter" |
OpenRouter context caching | OpenRouter docs |
"azure" |
Azure API Management cache policy | Azure APIM |
None (default) |
Native OpenAI server-side cache | Automatic |
Direct Claude key? Use the
Claudemodel directly — it sends explicitcache_controlblocks to Anthropic and is more efficient than going through a proxy.
Technique 3 — Token Budget Manager
Set a hard token ceiling on every request. Buddy counts tokens before sending, warns you as you approach the limit, and automatically drops the oldest history turns to keep requests within budget.
from buddy.agent import Agent
from buddy.models.openai import OpenAIChat
from buddy.models.token_budget import TokenBudgetConfig
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
add_history_to_messages=True,
num_history_runs=20, # keep lots of history...
token_budget=TokenBudgetConfig(
max_input_tokens=100_000, # ...but never exceed this
target_input_tokens=80_000, # trim until we're here (default: 90 %)
auto_compress=True, # drop oldest turns automatically
warn_at=0.85, # log a warning at 85 % usage
count_method="approx", # "approx" (fast) or "tiktoken" (exact)
max_tool_result_tokens=2_000,# also trim oversized tool outputs
),
)
Reading budget metrics
response = agent.run("...")
budget = (response.extra_data or {}).get("token_budget", {})
print("Estimated input tokens:", budget.get("estimated_tokens"))
print("After compression: ", budget.get("tokens_after"))
print("Turns dropped: ", budget.get("turns_dropped"))
print("Tokens trimmed: ", budget.get("tokens_trimmed"))
TokenBudgetConfig reference
| Field | Default | Description |
|---|---|---|
max_input_tokens |
100_000 |
Hard ceiling on input tokens per request |
target_input_tokens |
90 % of max | Compress history down to this level |
auto_compress |
True |
Drop oldest history turns automatically |
warn_at |
0.85 |
Warn fraction (0–1); None to silence |
count_method |
"approx" |
"approx" (len//4) or "tiktoken" |
max_tool_result_tokens |
2_000 |
Cap per tool result; None to disable |
Technique 4 — Automatic Tool-Result Trimming
Long tool outputs (web search results, file reads, API responses) can cost
thousands of tokens per call. When a TokenBudgetConfig is set, Buddy trims
any tool result that exceeds max_tool_result_tokens before appending it to
the message list, and adds a brief notice so the model knows the output was cut.
# Aggressive trimming — useful for search/scrape tools
token_budget=TokenBudgetConfig(
max_input_tokens=128_000,
max_tool_result_tokens=1_000, # ~4 KB per tool call
)
Tool results that fit under the limit are passed through unchanged.
Technique 5 — History Truncation (num_history_runs)
The simplest dial: num_history_runs (default 3) controls how many past
conversation turns are appended to each prompt. Combine with token budget for
best results:
agent = Agent(
model=...,
add_history_to_messages=True,
num_history_runs=5, # include last 5 turns
token_budget=TokenBudgetConfig(
max_input_tokens=60_000,
auto_compress=True, # drop turns if 5 is still too many
),
)
Technique 6 — Session Summaries (instead of raw history)
For very long sessions, Buddy can summarize past context instead of including raw messages. This keeps the history compact while preserving meaning:
agent = Agent(
model=...,
enable_session_summaries=True, # generate a rolling summary
add_history_to_messages=False, # don't include raw turns
add_session_summary_references=True, # inject the summary instead
)
Combining all techniques
from buddy.agent import Agent
from buddy.models.anthropic import Claude
from buddy.models.token_budget import TokenBudgetConfig
agent = Agent(
model=Claude(id="claude-opus-4-5"),
# --- Prompt caching (saves ~90 % on repeated system/tool prefix) ---
cache_prompt=True,
# --- History ---
add_history_to_messages=True,
num_history_runs=10,
# --- Token budget (auto-compress + tool trimming) ---
token_budget=TokenBudgetConfig(
max_input_tokens=180_000, # Claude's 200k window - output reserve
target_input_tokens=140_000,
auto_compress=True,
warn_at=0.80,
max_tool_result_tokens=1_500,
),
)
Proxy + Caching decision tree
Using Claude?
├── Direct Anthropic key
│ └── Use Claude model + cache_prompt=True ← BEST: explicit cache_control
└── Through a proxy
├── LiteLLM → OpenAIChat(proxy_type="litellm", cache_prompt=True)
├── OpenRouter → OpenAIChat(proxy_type="openrouter", cache_prompt=True)
└── Azure → OpenAIChat(proxy_type="azure", cache_prompt=True)
Using GPT-4o?
└── OpenAI server-side auto-caching (no flag needed)
└── cache_prompt=True surfaces metrics in RunResponse
Cost impact at a glance
| Technique | Typical saving | Best for |
|---|---|---|
| Prompt caching (Anthropic) | 60–90 % of prompt cost | Long system prompts, big tool lists |
| Prompt caching (OpenAI auto) | 50 % of cached prefix | Repeated prompts / same user |
| Proxy caching | Full response reuse | Identical queries |
| Token budget compression | Prevents runaway costs | Long multi-turn sessions |
| Tool-result trimming | 30–80 % of tool tokens | Search, scraping, large API responses |
| Session summaries | 70–95 % of history tokens | Very long (50+ turn) sessions |