Data Preparation
Before training, your raw files are turned into clean text chunks by
DataProcessor (buddy.train.data_processor). You rarely call it directly —
train_model runs it for you — but understanding what it accepts helps you
assemble a good dataset.
Just point it at a folder
DataProcessor traverses a directory recursively and tries to read
every file as text. There is no required format or schema: drop your
.txt, .md, .csv, .json, .pdf, and .docx files into a folder and
train on it.
What it accepts
DataProcessor.process_directory(path) walks the directory and, for each file:
- Skips hidden files, files over 100 MB, and known binary formats (images, archives, executables, etc.) detected via magic-number signatures.
- Extracts text using a specialized reader when available:
- PDF via
pdfplumber, falling back toPyPDF2 - DOCX via
python-docx(paragraphs and tables)
- PDF via
- Otherwise reads the file as text, auto-detecting the encoding.
Encoding detection
Text files don't need to be UTF-8. The processor first tries chardet, then
falls back through a long list of encodings (utf-8, utf-16, latin-1,
cp1252, shift_jis, gb2312, big5, and more), and finally decodes with
errors="ignore" as a last resort. Files it genuinely cannot read are skipped.
How it cleans and chunks
Each extracted text is normalized by _clean_text:
- Control characters are stripped.
- Repeated spaces/tabs collapse to a single space.
- Three-or-more blank lines collapse to a double newline.
- Whitespace-only lines are removed.
Texts shorter than min_text_length (default 10 characters) are dropped.
Longer texts are split on word boundaries into chunks no larger than
max_text_length (default 10000 characters).
from buddy.train import DataProcessor
processor = DataProcessor(min_text_length=10, max_text_length=10000)
data = processor.process_directory("/path/to/data")
print(f"Texts: {len(data.texts)}")
print(f"Stats: {data.stats}")
The ProcessedData result
process_directory returns a ProcessedData dataclass with three fields:
| Field | Type | Contents |
|---|---|---|
texts |
List[str] |
The cleaned, chunked text used for training |
metadata |
List[Dict] |
Per-chunk info: source file, file type, encoding, chunk index, char/word counts |
stats |
Dict |
Totals: processed_files, skipped_files, total_characters, avg_text_length, plus encoding/file-type distributions |
Saving and reloading
You can persist processed data to JSON and load it later — useful for reproducible runs or inspecting what was extracted:
processor.save_processed_data(data, "processed.json")
data = processor.load_processed_data("processed.json")
Training is unsupervised text
Fine-tuning here is causal language modeling over your raw text — there is no instruction/response labeling step. To bias the model toward a conversational style, include text that already reads like the dialogue you want.
Next steps
- Model Training — turn
ProcessedDatainto a trained model