Text Chunking for RAG and LLMs: Size, Overlap, and Boundaries That Actually Work
By Safe Local Tools Editorial
RAG fails quietly when chunks are the wrong size. Too small, you lose context; too large, embeddings dilute meaning and blow the token budget. Random 512-character splits cut sentences in half and bury the one paragraph that answered the user’s question.
This guide explains chunking for retrieval-augmented generation, overlap trade-offs, structure-aware splitting, and how to experiment with Safe Local Tools text chunking locally before you send proprietary docs to an embedding API.

What chunking is doing in a RAG pipeline
Typical flow:
- Ingest documents (PDF, HTML, tickets).
- Split into chunks.
- Embed each chunk into a vector index.
- At query time, retrieve top-k chunks.
- Stuff chunks into the LLM prompt.
Chunk quality caps answer quality. No model upgrade fixes retrieval that never saw the right paragraph.
Token vs character chunk sizes
Embeddings often bill by tokens; storage APIs count bytes. Rules of thumb:
| Content type | Starting chunk (tokens) | Notes |
|---|---|---|
| Technical docs with headings | 400–800 | Respect sections |
| Chat logs | 200–400 | Speaker boundaries |
| Legal prose | 800–1200 | Overlap critical |
| Code | Function/class level | Do not split mid-block |
Measure on your corpus—legal text differs from API reference.
Overlap: why 10–20% helps
Overlap means chunk N shares tail sentences with chunk N+1. Benefits:
- Sentences spanning boundaries appear whole in at least one chunk.
- Embeddings capture local context duplicated near edges.
Costs:
- More storage and embedding spend.
- Duplicate hits in retrieval unless you dedupe.
Start with 10–15% overlap of chunk size; tune recall/precision on a labeled question set.
Structure-aware splitting beats raw length
Prefer boundaries in order:
- Markdown/HTML headings (
##) - Paragraph breaks (double newline)
- Sentence endings
- Whitespace
- Hard character cut (last resort)
For code, split on functions; for tables, keep rows together or serialize row-wise with headers repeated.
Metadata you should attach to each chunk
Store alongside vectors:
source_file,page,section_titleupdated_atfor freshness boostsaclor tenant id for multi-tenant searchchunk_index/total_chunksfor reconstruction
Retrieval without metadata forces the model to guess provenance—bad for citations.
Evaluation: stop guessing, start measuring
Build 30–50 real user questions with gold passages labeled. Track:
- Recall@k — gold chunk in top k?
- MRR — rank of first correct chunk
- Answer faithfulness with LLM-judge or human review
Change chunk size one variable at a time; log embedding model version too.
Common failure modes
- Chunks include navigation chrome — menus pollute embeddings.
- Boilerplate repeated — footers dominate similarity.
- No language detection — mixed locales in one index.
- Huge tables flattened — numbers align poorly semantically.
- Stale docs — retrieval returns deprecated API versions.
Preprocess: strip templates, dedupe headers, version your index.
Hybrid search still needs good chunks
Teams add BM25 + vectors. Keyword search likes rare tokens intact; vectors like semantic neighborhoods. Chunks that split product codes (PT-2048-A) break both.
Token budget after retrieval
Even perfect retrieval must fit the model window. If k=8 chunks × 600 tokens = 4,800 tokens before the user question, plan margins for system prompts and answers.
Re-rank to top 3–5 after a cheap bi-encoder stage if needed.
Privacy and local experimentation
Corporate playbooks should not be uploaded to random chunking demos. Safe Local Tools processes text in the browser so you can tune sizes on sensitive drafts offline.
Chunking recipes by use case
Internal wiki: heading-based, 600 tokens, 80-token overlap, drop nav.
Support tickets: one chunk per message thread segment; metadata ticket id.
Contracts: clause headings if available; higher overlap; human review on answers.
API docs: per endpoint section; include method + path in chunk prefix.
When to skip RAG entirely
Short FAQs that fit in context do not need vectors. Chunking adds moving parts—use direct prompt injection if total corpus < ~8k tokens for your model.
Iteration mindset
Chunking is hyperparameter tuning for text. Version your splitter config in git (chunk_size=512, overlap=64, splitter=markdown). Deploy changes with re-embed jobs scheduled off-peak.
Multilingual corpora
Mixed English/Chinese docs need language-aware splitting or separate indexes per language. Cross-lingual retrieval models exist but cost more; do not assume English chunk sizes fit CJK paragraphs.
PDF ingestion realities
PDF text extraction order may not match visual reading order. Chunk the extracted stream only after validating column layouts; consider re-OCR for scanned PDFs instead of embedding garbage characters.
Citation UX in the product
When answers quote chunks, show section titles and links back to source anchors. Users trust answers they can verify—chunk metadata powers that UI.
Semantic vs fixed windows
Fixed windows are predictable and fast. Semantic chunkers use embeddings or model boundaries to split when topic shifts—better for messy PDFs, slower and nondeterministic. Hybrid pipelines fixed-split then merge tiny neighbors below a minimum size.
Answer-time compression of chunks
Even good chunks may be verbose. Some teams run a cheap model to summarize each chunk at index time (“summary field”) and embed both full text and summary vectors. Query routing picks the appropriate field.
Negative examples to label in eval sets
Collect user questions that retrieved wrong chunks—those become gold negatives when tuning overlap. Without negatives, teams overfit to demo questions that never appear in production.
Cost control on re-embeds
Re-chunking 10M chunks because overlap changed by 5% is expensive. Version embedding configs; batch jobs with checkpointing; compare recall deltas on a holdout set before full reindex.
Parent-child chunk linking
Store small chunks for retrieval but link each to a larger “parent” span shown to the model after selection. Users get precise search hits with enough surrounding context to answer “why” questions—not just definitional sentences.
Tool-calling and chunk boundaries
Agents that fetch “tools” may return JSON larger than a single chunk should hold. Split tool outputs with the same rules as static docs, or summarize tool JSON before embedding.
Human-in-the-loop review queues
Support teams approving answers should see which chunk IDs were used. If reviewers frequently override, those chunks are misaligned—feed overrides back into chunk tuning.
Regulatory retention
Chunks inherit document retention policies. If GDPR erasure removes a source file, delete associated vectors and metadata in the same job—orphan vectors are a compliance leak.
Try local splits before you embed
Bad chunks are the silent killer of RAG demos that “worked in the notebook.” Fix retrieval first; then argue about which LLM sounds smarter.
When you want to preview chunk boundaries on a long document without uploading it, Try the Text Chunker →