Documents & Chunking
Before text can be embedded, it must be read into Document objects and
chunked into passages small enough to embed and retrieve. Both steps live in
buddy.document.
Pipeline position
reader → chunking strategy → embedder → vector db. Readers produce
Documents; chunking strategies split each Document into a list of smaller
Documents. If no strategy is set, the reader falls back to
FixedSizeChunking.
Readers
A reader subclasses Reader (buddy.document.reader.base) — a dataclass with
chunk=True, chunk_size=5000, a list of separators, and an optional
chunking_strategy. Each reader implements read() / async_read() and returns
List[Document].
| Reader | Module | Reads |
|---|---|---|
PDFReader, PDFImageReader |
pdf_reader |
Local PDF files |
PDFUrlReader, PDFUrlImageReader |
pdf_reader |
PDFs from a URL |
TextReader |
text_reader |
Plain text files |
CSVReader, CSVUrlReader |
csv_reader |
CSV files / CSV URLs |
DocxReader |
docx_reader |
Word .doc / .docx |
JSONReader |
json_reader |
JSON files |
MarkdownReader |
markdown_reader |
Markdown files |
URLReader |
url_reader |
Web pages by URL |
WebsiteReader |
website_reader |
Crawled websites |
ArxivReader |
arxiv_reader |
arXiv search results |
YouTubeReader |
youtube_reader |
YouTube transcripts |
FirecrawlReader |
firecrawl_reader |
Firecrawl scrapes |
Most of the time you do not instantiate a reader directly — each knowledge base ships with a sensible default reader (see Knowledge sources).
Chunking strategies
All strategies subclass ChunkingStrategy
(buddy.document.chunking.strategy) and implement
chunk(document) -> List[Document].
| Strategy | Class | Key parameters | Behavior |
|---|---|---|---|
| Fixed | FixedSizeChunking |
chunk_size=5000, overlap=0 |
Fixed-size windows, avoids splitting mid-word |
| Recursive | RecursiveChunking |
chunk_size=5000, overlap=0 |
Splits at natural breakpoints (\n, .) |
| Document | DocumentChunking |
chunk_size=5000, overlap=0 |
Groups whole paragraphs (\n\n) up to the size limit |
| Markdown | MarkdownChunking |
chunk_size=5000, overlap=0 |
Structure-aware split by headers/sections |
| Semantic | SemanticChunking |
embedder, chunk_size=5000, similarity_threshold=0.5 |
Embedding-similarity boundaries |
| Agentic | AgenticChunking |
model, max_chunk_size=5000 |
An LLM picks breakpoints |
| Row | RowChunking |
skip_header=False, clean_rows=True |
One chunk per line/row (tabular data) |
Optional dependencies
SemanticChunking requires chonkie (pip install chonkie) and
MarkdownChunking requires unstructured (pip install unstructured markdown).
Both raise an ImportError on import if the dependency is missing.
Choosing a strategy
from buddy.knowledge.text import TextKnowledgeBase
from buddy.document.chunking.recursive import RecursiveChunking
from buddy.vectordb.chroma import ChromaDb
knowledge = TextKnowledgeBase(
path="notes/",
vector_db=ChromaDb(collection="notes"),
chunking_strategy=RecursiveChunking(chunk_size=1200, overlap=100),
)
knowledge.load()
Setting chunking_strategy on the knowledge base propagates it to the reader (an
AgentKnowledge validator wires the reader's chunking_strategy if it is unset).
Practical guidance
- Prose / mixed docs —
RecursiveChunkingorDocumentChunkingkeep sentences and paragraphs intact. - Markdown —
MarkdownChunkingpreserves headers and sections. - CSV / logs —
RowChunkingkeeps each record addressable. - Highest retrieval quality —
SemanticChunkinggroups text by meaning, at the cost of extra embedding calls. - Maximum control over cost —
FixedSizeChunking(the default) is deterministic and dependency-free.