Skip to content

Performance

Practical knobs for making agents faster and cheaper. Every option below maps to a real parameter in the buddy package.

Pick the right model

Model choice dominates both latency and cost. Use a small, fast model for simple tasks and reserve large models for hard ones.

from buddy.models.openai import OpenAIChat

fast = OpenAIChat(id="gpt-4o-mini")   # quick, cheap
strong = OpenAIChat(id="gpt-4o")       # slower, more capable

For zero-network-latency development, run locally with Ollama (from buddy.models.ollama import Ollama).

Limit conversation history

When history is enabled, only the last num_history_runs runs are included (default 3). Fewer runs means smaller prompts and lower cost.

agent = Agent(
    model=OpenAIChat(id="gpt-4o-mini"),
    add_history_to_messages=True,
    num_history_runs=2,   # keep prompts small
)

History is off by default

add_history_to_messages defaults to False. Only enable it (and keep num_history_runs low) when the task actually needs prior turns.

Cap tool calls

Bound how many tools a single run may invoke with tool_call_limit. This prevents runaway loops and keeps latency predictable.

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    tools=[...],
    tool_call_limit=3,
)

Stream for perceived latency

Streaming does not make generation faster, but it shows tokens as they arrive, which dramatically improves perceived responsiveness in UIs.

for event in agent.run("Write a long explanation.", stream=True):
    if event.content:
        print(event.content, end="", flush=True)

Go async for concurrency

Use arun() to run many agents concurrently instead of serially — ideal for batch processing or serving multiple requests.

import asyncio

async def main():
    results = await asyncio.gather(
        agent.arun("Question 1"),
        agent.arun("Question 2"),
        agent.arun("Question 3"),
    )
    for r in results:
        print(r.content)

asyncio.run(main())

Reuse sessions and knowledge

Knob Where Effect
cache_session Agent(cache_session=True) (default) Keeps the session in memory to avoid re-reading from storage each run.
num_documents AgentKnowledge(num_documents=5) Fewer retrieved chunks → smaller prompts. Lower it for speed.
knowledge.load(recreate=False) knowledge base Reuse an existing index instead of rebuilding it every run.
from buddy.knowledge.pdf import PDFKnowledgeBase
from buddy.vectordb.chroma import ChromaDb

kb = PDFKnowledgeBase(
    path="docs/",
    vector_db=ChromaDb(collection="docs", persistent_client=True),
    num_documents=3,   # retrieve fewer, more-relevant chunks
)
kb.load(recreate=False)  # build once, reuse afterwards

Measure before optimizing

Read RunResponse.metrics to see token usage and timing per run, then tune the knobs above where they matter most. See Debugging.