Data Preparation

Before training, your raw files are turned into clean text chunks by DataProcessor (buddy.train.data_processor). You rarely call it directly — train_model runs it for you — but understanding what it accepts helps you assemble a good dataset.

Just point it at a folder

DataProcessor traverses a directory recursively and tries to read every file as text. There is no required format or schema: drop your .txt, .md, .csv, .json, .pdf, and .docx files into a folder and train on it.

What it accepts

DataProcessor.process_directory(path) walks the directory and, for each file:

Skips hidden files, files over 100 MB, and known binary formats (images, archives, executables, etc.) detected via magic-number signatures.
Extracts text using a specialized reader when available:
- PDF via pdfplumber, falling back to PyPDF2
- DOCX via python-docx (paragraphs and tables)
Otherwise reads the file as text, auto-detecting the encoding.

Encoding detection

Text files don't need to be UTF-8. The processor first tries chardet, then falls back through a long list of encodings (utf-8, utf-16, latin-1, cp1252, shift_jis, gb2312, big5, and more), and finally decodes with errors="ignore" as a last resort. Files it genuinely cannot read are skipped.

How it cleans and chunks

Each extracted text is normalized by _clean_text:

Control characters are stripped.
Repeated spaces/tabs collapse to a single space.
Three-or-more blank lines collapse to a double newline.
Whitespace-only lines are removed.

Texts shorter than min_text_length (default 10 characters) are dropped. Longer texts are split on word boundaries into chunks no larger than max_text_length (default 10000 characters).

from buddy.train import DataProcessor

processor = DataProcessor(min_text_length=10, max_text_length=10000)
data = processor.process_directory("/path/to/data")

print(f"Texts: {len(data.texts)}")
print(f"Stats: {data.stats}")

The `ProcessedData` result

process_directory returns a ProcessedData dataclass with three fields:

Field	Type	Contents
`texts`	`List[str]`	The cleaned, chunked text used for training
`metadata`	`List[Dict]`	Per-chunk info: source file, file type, encoding, chunk index, char/word counts
`stats`	`Dict`	Totals: `processed_files`, `skipped_files`, `total_characters`, `avg_text_length`, plus encoding/file-type distributions

Saving and reloading

You can persist processed data to JSON and load it later — useful for reproducible runs or inspecting what was extracted:

processor.save_processed_data(data, "processed.json")
data = processor.load_processed_data("processed.json")

Training is unsupervised text

Fine-tuning here is causal language modeling over your raw text — there is no instruction/response labeling step. To bias the model toward a conversational style, include text that already reads like the dialogue you want.

Next steps

Model Training — turn ProcessedData into a trained model