Skip to content

Token Efficiency

Buddy AI gives you several layered techniques to reduce token usage, API costs, and latency — from provider-native prompt caching to automatic context compression. All techniques compose: you can enable any combination.


Technique 1 — Prompt Caching

The fastest, cheapest win. See the dedicated Prompt Caching guide.

# Works on both Claude and GPT-4o — one flag
agent = Agent(model=Claude(id="claude-opus-4-5"), cache_prompt=True)

Technique 2 — Proxy-Aware Caching

When your company routes LLM traffic through a proxy (LiteLLM, OpenRouter, Azure AI Gateway), Buddy automatically injects the right cache headers so the proxy's own caching layer fires too — even for Claude models served through an OpenAI-compatible endpoint.

from buddy.models.openai import OpenAIChat

# Claude behind a LiteLLM proxy
model = OpenAIChat(
    id="anthropic/claude-opus-4-5",
    base_url="http://0.0.0.0:4000/v1",
    api_key="sk-my-litellm-key",
    proxy_type="litellm",   # injects {"cache": {"no-cache": False}}
    cache_prompt=True,
)

# Any model behind OpenRouter
model = OpenAIChat(
    id="anthropic/claude-opus-4-5",
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-v1-...",
    proxy_type="openrouter",  # injects X-OpenRouter-Cache header
    cache_prompt=True,
)

# Azure AI Management
model = OpenAIChat(
    id="gpt-4o",
    base_url="https://my-gateway.azure-api.net/openai",
    proxy_type="azure",     # injects x-ms-cache-enabled header
    cache_prompt=True,
)

Supported proxy_type values:

Value Cache mechanism Reference
"litellm" Redis / in-memory exact-match cache LiteLLM caching docs
"openrouter" OpenRouter context caching OpenRouter docs
"azure" Azure API Management cache policy Azure APIM
None (default) Native OpenAI server-side cache Automatic

Direct Claude key? Use the Claude model directly — it sends explicit cache_control blocks to Anthropic and is more efficient than going through a proxy.


Technique 3 — Token Budget Manager

Set a hard token ceiling on every request. Buddy counts tokens before sending, warns you as you approach the limit, and automatically drops the oldest history turns to keep requests within budget.

from buddy.agent import Agent
from buddy.models.openai import OpenAIChat
from buddy.models.token_budget import TokenBudgetConfig

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    add_history_to_messages=True,
    num_history_runs=20,             # keep lots of history...
    token_budget=TokenBudgetConfig(
        max_input_tokens=100_000,    # ...but never exceed this
        target_input_tokens=80_000,  # trim until we're here (default: 90 %)
        auto_compress=True,          # drop oldest turns automatically
        warn_at=0.85,                # log a warning at 85 % usage
        count_method="approx",       # "approx" (fast) or "tiktoken" (exact)
        max_tool_result_tokens=2_000,# also trim oversized tool outputs
    ),
)

Reading budget metrics

response = agent.run("...")
budget = (response.extra_data or {}).get("token_budget", {})
print("Estimated input tokens:", budget.get("estimated_tokens"))
print("After compression:     ", budget.get("tokens_after"))
print("Turns dropped:         ", budget.get("turns_dropped"))
print("Tokens trimmed:        ", budget.get("tokens_trimmed"))

TokenBudgetConfig reference

Field Default Description
max_input_tokens 100_000 Hard ceiling on input tokens per request
target_input_tokens 90 % of max Compress history down to this level
auto_compress True Drop oldest history turns automatically
warn_at 0.85 Warn fraction (0–1); None to silence
count_method "approx" "approx" (len//4) or "tiktoken"
max_tool_result_tokens 2_000 Cap per tool result; None to disable

Technique 4 — Automatic Tool-Result Trimming

Long tool outputs (web search results, file reads, API responses) can cost thousands of tokens per call. When a TokenBudgetConfig is set, Buddy trims any tool result that exceeds max_tool_result_tokens before appending it to the message list, and adds a brief notice so the model knows the output was cut.

# Aggressive trimming — useful for search/scrape tools
token_budget=TokenBudgetConfig(
    max_input_tokens=128_000,
    max_tool_result_tokens=1_000,   # ~4 KB per tool call
)

Tool results that fit under the limit are passed through unchanged.


Technique 5 — History Truncation (num_history_runs)

The simplest dial: num_history_runs (default 3) controls how many past conversation turns are appended to each prompt. Combine with token budget for best results:

agent = Agent(
    model=...,
    add_history_to_messages=True,
    num_history_runs=5,              # include last 5 turns
    token_budget=TokenBudgetConfig(
        max_input_tokens=60_000,
        auto_compress=True,          # drop turns if 5 is still too many
    ),
)

Technique 6 — Session Summaries (instead of raw history)

For very long sessions, Buddy can summarize past context instead of including raw messages. This keeps the history compact while preserving meaning:

agent = Agent(
    model=...,
    enable_session_summaries=True,   # generate a rolling summary
    add_history_to_messages=False,   # don't include raw turns
    add_session_summary_references=True,  # inject the summary instead
)

Combining all techniques

from buddy.agent import Agent
from buddy.models.anthropic import Claude
from buddy.models.token_budget import TokenBudgetConfig

agent = Agent(
    model=Claude(id="claude-opus-4-5"),

    # --- Prompt caching (saves ~90 % on repeated system/tool prefix) ---
    cache_prompt=True,

    # --- History ---
    add_history_to_messages=True,
    num_history_runs=10,

    # --- Token budget (auto-compress + tool trimming) ---
    token_budget=TokenBudgetConfig(
        max_input_tokens=180_000,    # Claude's 200k window - output reserve
        target_input_tokens=140_000,
        auto_compress=True,
        warn_at=0.80,
        max_tool_result_tokens=1_500,
    ),
)

Proxy + Caching decision tree

Using Claude?
├── Direct Anthropic key
│   └── Use Claude model + cache_prompt=True  ← BEST: explicit cache_control
└── Through a proxy
    ├── LiteLLM  → OpenAIChat(proxy_type="litellm", cache_prompt=True)
    ├── OpenRouter → OpenAIChat(proxy_type="openrouter", cache_prompt=True)
    └── Azure    → OpenAIChat(proxy_type="azure", cache_prompt=True)

Using GPT-4o?
└── OpenAI server-side auto-caching (no flag needed)
    └── cache_prompt=True surfaces metrics in RunResponse

Cost impact at a glance

Technique Typical saving Best for
Prompt caching (Anthropic) 60–90 % of prompt cost Long system prompts, big tool lists
Prompt caching (OpenAI auto) 50 % of cached prefix Repeated prompts / same user
Proxy caching Full response reuse Identical queries
Token budget compression Prevents runaway costs Long multi-turn sessions
Tool-result trimming 30–80 % of tool tokens Search, scraping, large API responses
Session summaries 70–95 % of history tokens Very long (50+ turn) sessions