Skip to content

Documents & Chunking

Before text can be embedded, it must be read into Document objects and chunked into passages small enough to embed and retrieve. Both steps live in buddy.document.

Pipeline position

reader → chunking strategy → embedder → vector db. Readers produce Documents; chunking strategies split each Document into a list of smaller Documents. If no strategy is set, the reader falls back to FixedSizeChunking.

Readers

A reader subclasses Reader (buddy.document.reader.base) — a dataclass with chunk=True, chunk_size=5000, a list of separators, and an optional chunking_strategy. Each reader implements read() / async_read() and returns List[Document].

Reader Module Reads
PDFReader, PDFImageReader pdf_reader Local PDF files
PDFUrlReader, PDFUrlImageReader pdf_reader PDFs from a URL
TextReader text_reader Plain text files
CSVReader, CSVUrlReader csv_reader CSV files / CSV URLs
DocxReader docx_reader Word .doc / .docx
JSONReader json_reader JSON files
MarkdownReader markdown_reader Markdown files
URLReader url_reader Web pages by URL
WebsiteReader website_reader Crawled websites
ArxivReader arxiv_reader arXiv search results
YouTubeReader youtube_reader YouTube transcripts
FirecrawlReader firecrawl_reader Firecrawl scrapes
from buddy.document.reader import Reader  # base class

Most of the time you do not instantiate a reader directly — each knowledge base ships with a sensible default reader (see Knowledge sources).

Chunking strategies

All strategies subclass ChunkingStrategy (buddy.document.chunking.strategy) and implement chunk(document) -> List[Document].

Strategy Class Key parameters Behavior
Fixed FixedSizeChunking chunk_size=5000, overlap=0 Fixed-size windows, avoids splitting mid-word
Recursive RecursiveChunking chunk_size=5000, overlap=0 Splits at natural breakpoints (\n, .)
Document DocumentChunking chunk_size=5000, overlap=0 Groups whole paragraphs (\n\n) up to the size limit
Markdown MarkdownChunking chunk_size=5000, overlap=0 Structure-aware split by headers/sections
Semantic SemanticChunking embedder, chunk_size=5000, similarity_threshold=0.5 Embedding-similarity boundaries
Agentic AgenticChunking model, max_chunk_size=5000 An LLM picks breakpoints
Row RowChunking skip_header=False, clean_rows=True One chunk per line/row (tabular data)

Optional dependencies

SemanticChunking requires chonkie (pip install chonkie) and MarkdownChunking requires unstructured (pip install unstructured markdown). Both raise an ImportError on import if the dependency is missing.

Choosing a strategy

from buddy.knowledge.text import TextKnowledgeBase
from buddy.document.chunking.recursive import RecursiveChunking
from buddy.vectordb.chroma import ChromaDb

knowledge = TextKnowledgeBase(
    path="notes/",
    vector_db=ChromaDb(collection="notes"),
    chunking_strategy=RecursiveChunking(chunk_size=1200, overlap=100),
)
knowledge.load()

Setting chunking_strategy on the knowledge base propagates it to the reader (an AgentKnowledge validator wires the reader's chunking_strategy if it is unset).

Practical guidance

  • Prose / mixed docsRecursiveChunking or DocumentChunking keep sentences and paragraphs intact.
  • MarkdownMarkdownChunking preserves headers and sections.
  • CSV / logsRowChunking keeps each record addressable.
  • Highest retrieval qualitySemanticChunking groups text by meaning, at the cost of extra embedding calls.
  • Maximum control over costFixedSizeChunking (the default) is deterministic and dependency-free.