RAG Chunking: Recursive Character Splitting
Recursive character splitting is the most practical chunking strategy for real documents — it respects natural boundaries like paragraphs and sentences, falls back gracefully, and uses overlap to preserve cross-boundary context.
This is Part 15 of the AI Agents series. Parts 13–14 covered fixed-size and sentence-based chunking. This post covers recursive character splitting — the most widely used chunking strategy in production RAG systems.
1. Two reasons chunking is non-negotiable
Embedding quality. A 10,000-word document covering politics, sports, and finance produces one embedding that tries to represent all three topics simultaneously. That vector is diluted — it doesn’t focus on anything well. Retrieval against it will be imprecise. Smaller, focused chunks produce sharper vectors.
Context window limits. Every embedding model has a maximum input length (in tokens). Feed it more and it throws an error — it won’t silently truncate.
| Model | Max tokens |
|---|---|
all-MiniLM-L6-v2 | 512 |
Cohere embed v3 | 512 |
MPNet | 512 |
text-embedding-3-small (OpenAI) | 8,192 |
If your document has 50,000 tokens and your model handles 512, you have no choice but to chunk. The question is how.
2. Recursive character splitting: the strategy
Recursive character splitting uses a priority-ordered list of separators. It tries each one in sequence and only falls back to the next if the resulting chunks are still too large.
Default priority order:
\n\n— paragraph breaks (highest priority)\n— line breaks— word boundaries""— individual characters (last resort)
The algorithm works like this:
Given text T and target chunk_size:
1. Try splitting T on "\n\n"
└─ If all pieces fit within chunk_size → done
└─ If some pieces are still too large → recursively apply to those pieces using "\n"
2. Try splitting oversized pieces on "\n"
└─ If all pieces fit → done
└─ Still too large → recurse with " "
3. Try splitting on " "
└─ Still too large → split on characters (guarantees termination)
4. Merge small adjacent pieces back together up to chunk_size
The key insight: it tries to keep semantically coherent units (paragraphs, then sentences, then words) intact. It only forces a hard character split when there’s no other choice.
3. Chunk overlap
Splitting at any boundary — even a good one like a paragraph — can separate a reference from what it refers to. A paragraph might start with “This approach…” where “this” refers to something in the previous paragraph.
Overlap carries the tail of each chunk into the start of the next, preserving cross-boundary context.
Without overlap:
chunk 1: "...Sun rises in the east."
chunk 2: "It provides energy for all life on Earth."
With overlap (8 chars from end of chunk 1):
chunk 1: "...Sun rises in the east."
chunk 2: "he east. It provides energy for all life on Earth."
The overlap means that even if a sentence is cut, the following chunk has enough prior context to produce a coherent embedding.
How much overlap? Experiment with your data. A common starting point is 10–20% of chunk size. For chunk_size=500, try chunk_overlap=50 to 100.
4. Implementation with LangChain
Writing this from scratch is possible but tedious — a naive merge step will combine unrelated paragraphs just to fill the chunk size, which corrupts context. LangChain’s RecursiveCharacterTextSplitter handles these edge cases correctly.
pip install langchain-text-splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=30,
length_function=len, # character count; use tiktoken for token count
)
document = """
Nerchuko was founded in 2024 by Ravi Kumar and Priya Singh.
The company operates in the AI education space.
Employees are entitled to 12 casual leaves and 6 sick leaves per year.
Leaves do not carry forward to the next calendar year.
Unused leaves are forfeited at year end.
Work hours are 9 AM to 6 PM Monday through Friday.
Flexible start times between 8 and 10 AM are permitted with manager approval.
Remote work is allowed Monday, Thursday, and Friday.
"""
chunks = splitter.split_text(document.strip())
for i, chunk in enumerate(chunks):
print(f"[{i}] ({len(chunk)} chars)\n{chunk}\n")
LangChain’s splitter will:
- First try to split on
\n\n(paragraph boundaries) - If a paragraph is still over 200 chars, split on
\nwithin it - Keep the 30-character overlap between consecutive chunks
- Never merge two unrelated paragraphs into one chunk just to hit the size target
5. Using token count instead of character count
Character count is a rough proxy. What the embedding model actually cares about is token count. A single Chinese character is one character but potentially multiple tokens. “tokenization” is one word but 4–5 tokens.
For precise control, use a tokenizer as the length_function:
import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter
tokenizer = tiktoken.get_encoding("cl100k_base")
def token_length(text: str) -> int:
return len(tokenizer.encode(text))
splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # 100 tokens per chunk
chunk_overlap=15, # 15-token overlap
length_function=token_length,
)
This is more accurate and the right approach when your embedding model has a token-based context limit (e.g. MiniLM’s 512 tokens).
6. Integrating with ChromaDB
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Chunk the document
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=40)
chunks = splitter.split_text(document.strip())
# Index in ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="recursive_chunks")
collection.upsert(
documents=chunks,
ids=[f"rc_{i}" for i in range(len(chunks))]
)
# Query
results = collection.query(
query_texts=["What is the remote work policy?"],
n_results=2
)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{dist:.4f}] {doc}")
7. Choosing chunk size and overlap
There’s no universally correct chunk size. It depends on:
- Your embedding model’s context window — chunks must fit within it
- Document type — dense technical text may need smaller chunks; narrative prose can handle larger
- Query type — short factual queries benefit from small focused chunks; broad summary queries benefit from larger chunks
Practical starting points:
| Document type | chunk_size | chunk_overlap |
|---|---|---|
| Short FAQ / policies | 200–300 | 30–50 |
| Articles / blog posts | 400–600 | 50–80 |
| Technical documentation | 300–500 | 50–100 |
| Research papers | 500–800 | 100–150 |
Run a small evaluation: take 10–20 representative questions, retrieve with different settings, and compare which retrieves the correct chunk most often. Tune from there.
What’s next
Part 16 covers sliding window chunking — a strategy that ignores natural boundaries entirely and instead slides a fixed-size window forward by a configurable stride, creating dense overlapping chunks that are especially useful for tasks like semantic search over narrative text.
Full video walkthrough is embedded above.