6 min readUpdated

Text Chunking for RAG and LLMs: Size, Overlap, and Boundaries That Actually Work

By Safe Local Tools Editorial

RAG fails quietly when chunks are the wrong size. Too small, you lose context; too large, embeddings dilute meaning and blow the token budget. Random 512-character splits cut sentences in half and bury the one paragraph that answered the user’s question.

This guide explains chunking for retrieval-augmented generation, overlap trade-offs, structure-aware splitting, and how to experiment with Safe Local Tools text chunking locally before you send proprietary docs to an embedding API.

OG illustration

What chunking is doing in a RAG pipeline

Typical flow:

  1. Ingest documents (PDF, HTML, tickets).
  2. Split into chunks.
  3. Embed each chunk into a vector index.
  4. At query time, retrieve top-k chunks.
  5. Stuff chunks into the LLM prompt.

Chunk quality caps answer quality. No model upgrade fixes retrieval that never saw the right paragraph.

Token vs character chunk sizes

Embeddings often bill by tokens; storage APIs count bytes. Rules of thumb:

Content typeStarting chunk (tokens)Notes
Technical docs with headings400–800Respect sections
Chat logs200–400Speaker boundaries
Legal prose800–1200Overlap critical
CodeFunction/class levelDo not split mid-block

Measure on your corpus—legal text differs from API reference.

Overlap: why 10–20% helps

Overlap means chunk N shares tail sentences with chunk N+1. Benefits:

  • Sentences spanning boundaries appear whole in at least one chunk.
  • Embeddings capture local context duplicated near edges.

Costs:

  • More storage and embedding spend.
  • Duplicate hits in retrieval unless you dedupe.

Start with 10–15% overlap of chunk size; tune recall/precision on a labeled question set.

Structure-aware splitting beats raw length

Prefer boundaries in order:

  1. Markdown/HTML headings (##)
  2. Paragraph breaks (double newline)
  3. Sentence endings
  4. Whitespace
  5. Hard character cut (last resort)

For code, split on functions; for tables, keep rows together or serialize row-wise with headers repeated.

Metadata you should attach to each chunk

Store alongside vectors:

  • source_file, page, section_title
  • updated_at for freshness boosts
  • acl or tenant id for multi-tenant search
  • chunk_index / total_chunks for reconstruction

Retrieval without metadata forces the model to guess provenance—bad for citations.

Evaluation: stop guessing, start measuring

Build 30–50 real user questions with gold passages labeled. Track:

  • Recall@k — gold chunk in top k?
  • MRR — rank of first correct chunk
  • Answer faithfulness with LLM-judge or human review

Change chunk size one variable at a time; log embedding model version too.

Common failure modes

  1. Chunks include navigation chrome — menus pollute embeddings.
  2. Boilerplate repeated — footers dominate similarity.
  3. No language detection — mixed locales in one index.
  4. Huge tables flattened — numbers align poorly semantically.
  5. Stale docs — retrieval returns deprecated API versions.

Preprocess: strip templates, dedupe headers, version your index.

Hybrid search still needs good chunks

Teams add BM25 + vectors. Keyword search likes rare tokens intact; vectors like semantic neighborhoods. Chunks that split product codes (PT-2048-A) break both.

Token budget after retrieval

Even perfect retrieval must fit the model window. If k=8 chunks × 600 tokens = 4,800 tokens before the user question, plan margins for system prompts and answers.

Re-rank to top 3–5 after a cheap bi-encoder stage if needed.

Privacy and local experimentation

Corporate playbooks should not be uploaded to random chunking demos. Safe Local Tools processes text in the browser so you can tune sizes on sensitive drafts offline.

Chunking recipes by use case

Internal wiki: heading-based, 600 tokens, 80-token overlap, drop nav.

Support tickets: one chunk per message thread segment; metadata ticket id.

Contracts: clause headings if available; higher overlap; human review on answers.

API docs: per endpoint section; include method + path in chunk prefix.

When to skip RAG entirely

Short FAQs that fit in context do not need vectors. Chunking adds moving parts—use direct prompt injection if total corpus < ~8k tokens for your model.

Iteration mindset

Chunking is hyperparameter tuning for text. Version your splitter config in git (chunk_size=512, overlap=64, splitter=markdown). Deploy changes with re-embed jobs scheduled off-peak.

Multilingual corpora

Mixed English/Chinese docs need language-aware splitting or separate indexes per language. Cross-lingual retrieval models exist but cost more; do not assume English chunk sizes fit CJK paragraphs.

PDF ingestion realities

PDF text extraction order may not match visual reading order. Chunk the extracted stream only after validating column layouts; consider re-OCR for scanned PDFs instead of embedding garbage characters.

Citation UX in the product

When answers quote chunks, show section titles and links back to source anchors. Users trust answers they can verify—chunk metadata powers that UI.

Semantic vs fixed windows

Fixed windows are predictable and fast. Semantic chunkers use embeddings or model boundaries to split when topic shifts—better for messy PDFs, slower and nondeterministic. Hybrid pipelines fixed-split then merge tiny neighbors below a minimum size.

Answer-time compression of chunks

Even good chunks may be verbose. Some teams run a cheap model to summarize each chunk at index time (“summary field”) and embed both full text and summary vectors. Query routing picks the appropriate field.

Negative examples to label in eval sets

Collect user questions that retrieved wrong chunks—those become gold negatives when tuning overlap. Without negatives, teams overfit to demo questions that never appear in production.

Cost control on re-embeds

Re-chunking 10M chunks because overlap changed by 5% is expensive. Version embedding configs; batch jobs with checkpointing; compare recall deltas on a holdout set before full reindex.

Parent-child chunk linking

Store small chunks for retrieval but link each to a larger “parent” span shown to the model after selection. Users get precise search hits with enough surrounding context to answer “why” questions—not just definitional sentences.

Tool-calling and chunk boundaries

Agents that fetch “tools” may return JSON larger than a single chunk should hold. Split tool outputs with the same rules as static docs, or summarize tool JSON before embedding.

Human-in-the-loop review queues

Support teams approving answers should see which chunk IDs were used. If reviewers frequently override, those chunks are misaligned—feed overrides back into chunk tuning.

Regulatory retention

Chunks inherit document retention policies. If GDPR erasure removes a source file, delete associated vectors and metadata in the same job—orphan vectors are a compliance leak.

Try local splits before you embed

Bad chunks are the silent killer of RAG demos that “worked in the notebook.” Fix retrieval first; then argue about which LLM sounds smarter.

When you want to preview chunk boundaries on a long document without uploading it, Try the Text Chunker →