2026-05-156 min readUpdated 2026-05-16

Text Chunking for RAG and LLMs: Size, Overlap, and Boundaries That Actually Work

By Safe Local Tools Editorial

RAG fails quietly when chunks are the wrong size. Too small, you lose context; too large, embeddings dilute meaning and blow the token budget. Random 512-character splits cut sentences in half and bury the one paragraph that answered the user’s question.

This guide explains chunking for retrieval-augmented generation, overlap trade-offs, structure-aware splitting, and how to experiment with Safe Local Tools text chunking locally before you send proprietary docs to an embedding API.

OG illustration

What chunking is doing in a RAG pipeline

Typical flow:

Ingest documents (PDF, HTML, tickets).
Split into chunks.
Embed each chunk into a vector index.
At query time, retrieve top-k chunks.
Stuff chunks into the LLM prompt.

Chunk quality caps answer quality. No model upgrade fixes retrieval that never saw the right paragraph.

Token vs character chunk sizes

Embeddings often bill by tokens; storage APIs count bytes. Rules of thumb:

Content type	Starting chunk (tokens)	Notes
Technical docs with headings	400–800	Respect sections
Chat logs	200–400	Speaker boundaries
Legal prose	800–1200	Overlap critical
Code	Function/class level	Do not split mid-block

Measure on your corpus—legal text differs from API reference.

Overlap: why 10–20% helps

Overlap means chunk N shares tail sentences with chunk N+1. Benefits:

Sentences spanning boundaries appear whole in at least one chunk.
Embeddings capture local context duplicated near edges.

Costs:

More storage and embedding spend.
Duplicate hits in retrieval unless you dedupe.

Start with 10–15% overlap of chunk size; tune recall/precision on a labeled question set.

Structure-aware splitting beats raw length

Prefer boundaries in order:

Markdown/HTML headings (##)
Paragraph breaks (double newline)
Sentence endings
Whitespace
Hard character cut (last resort)

For code, split on functions; for tables, keep rows together or serialize row-wise with headers repeated.

Metadata you should attach to each chunk

Store alongside vectors:

source_file, page, section_title
updated_at for freshness boosts
acl or tenant id for multi-tenant search
chunk_index / total_chunks for reconstruction

Retrieval without metadata forces the model to guess provenance—bad for citations.

Evaluation: stop guessing, start measuring

Build 30–50 real user questions with gold passages labeled. Track:

Recall@k — gold chunk in top k?
MRR — rank of first correct chunk
Answer faithfulness with LLM-judge or human review

Change chunk size one variable at a time; log embedding model version too.

Common failure modes

Chunks include navigation chrome — menus pollute embeddings.
Boilerplate repeated — footers dominate similarity.
No language detection — mixed locales in one index.
Huge tables flattened — numbers align poorly semantically.
Stale docs — retrieval returns deprecated API versions.

Preprocess: strip templates, dedupe headers, version your index.

Hybrid search still needs good chunks

Teams add BM25 + vectors. Keyword search likes rare tokens intact; vectors like semantic neighborhoods. Chunks that split product codes (PT-2048-A) break both.

Token budget after retrieval

Even perfect retrieval must fit the model window. If k=8 chunks × 600 tokens = 4,800 tokens before the user question, plan margins for system prompts and answers.

Re-rank to top 3–5 after a cheap bi-encoder stage if needed.

Privacy and local experimentation

Corporate playbooks should not be uploaded to random chunking demos. Safe Local Tools processes text in the browser so you can tune sizes on sensitive drafts offline.

Chunking recipes by use case

Internal wiki: heading-based, 600 tokens, 80-token overlap, drop nav.

Support tickets: one chunk per message thread segment; metadata ticket id.

Contracts: clause headings if available; higher overlap; human review on answers.

API docs: per endpoint section; include method + path in chunk prefix.

When to skip RAG entirely

Short FAQs that fit in context do not need vectors. Chunking adds moving parts—use direct prompt injection if total corpus < ~8k tokens for your model.

Iteration mindset

Chunking is hyperparameter tuning for text. Version your splitter config in git (chunk_size=512, overlap=64, splitter=markdown). Deploy changes with re-embed jobs scheduled off-peak.

Multilingual corpora

Mixed English/Chinese docs need language-aware splitting or separate indexes per language. Cross-lingual retrieval models exist but cost more; do not assume English chunk sizes fit CJK paragraphs.

PDF ingestion realities

PDF text extraction order may not match visual reading order. Chunk the extracted stream only after validating column layouts; consider re-OCR for scanned PDFs instead of embedding garbage characters.

Citation UX in the product

When answers quote chunks, show section titles and links back to source anchors. Users trust answers they can verify—chunk metadata powers that UI.

Semantic vs fixed windows

Fixed windows are predictable and fast. Semantic chunkers use embeddings or model boundaries to split when topic shifts—better for messy PDFs, slower and nondeterministic. Hybrid pipelines fixed-split then merge tiny neighbors below a minimum size.

Text Chunking for RAG and LLMs: Size, Overlap, and Boundaries That Actually Work

What chunking is doing in a RAG pipeline

Token vs character chunk sizes

Overlap: why 10–20% helps

Structure-aware splitting beats raw length

Metadata you should attach to each chunk

Evaluation: stop guessing, start measuring

Common failure modes

Hybrid search still needs good chunks

Token budget after retrieval

Privacy and local experimentation

Chunking recipes by use case

When to skip RAG entirely

Iteration mindset

Multilingual corpora

PDF ingestion realities

Citation UX in the product

Semantic vs fixed windows

Answer-time compression of chunks

Negative examples to label in eval sets

Cost control on re-embeds

Parent-child chunk linking

Tool-calling and chunk boundaries

Human-in-the-loop review queues

Regulatory retention

Try local splits before you embed