RAG Chunking: Recursive Character Splitting

This is Part 15 of the AI Agents series. Parts 13–14 covered fixed-size and sentence-based chunking. This post covers recursive character splitting — the most widely used chunking strategy in production RAG systems.

1. Two reasons chunking is non-negotiable

Embedding quality. A 10,000-word document covering politics, sports, and finance produces one embedding that tries to represent all three topics simultaneously. That vector is diluted — it doesn’t focus on anything well. Retrieval against it will be imprecise. Smaller, focused chunks produce sharper vectors.

Context window limits. Every embedding model has a maximum input length (in tokens). Feed it more and it throws an error — it won’t silently truncate.

Model	Max tokens
`all-MiniLM-L6-v2`	512
`Cohere embed v3`	512
`MPNet`	512
`text-embedding-3-small` (OpenAI)	8,192

If your document has 50,000 tokens and your model handles 512, you have no choice but to chunk. The question is how.

2. Recursive character splitting: the strategy

Recursive character splitting uses a priority-ordered list of separators. It tries each one in sequence and only falls back to the next if the resulting chunks are still too large.

Default priority order:

\n\n — paragraph breaks (highest priority)
\n — line breaks
— word boundaries
"" — individual characters (last resort)

The algorithm works like this:

Given text T and target chunk_size:

1. Try splitting T on "\n\n"
   └─ If all pieces fit within chunk_size → done
   └─ If some pieces are still too large → recursively apply to those pieces using "\n"

2. Try splitting oversized pieces on "\n"
   └─ If all pieces fit → done
   └─ Still too large → recurse with " "

3. Try splitting on " "
   └─ Still too large → split on characters (guarantees termination)

4. Merge small adjacent pieces back together up to chunk_size

The key insight: it tries to keep semantically coherent units (paragraphs, then sentences, then words) intact. It only forces a hard character split when there’s no other choice.

3. Chunk overlap

Splitting at any boundary — even a good one like a paragraph — can separate a reference from what it refers to. A paragraph might start with “This approach…” where “this” refers to something in the previous paragraph.

Overlap carries the tail of each chunk into the start of the next, preserving cross-boundary context.

Without overlap:
  chunk 1: "...Sun rises in the east."
  chunk 2: "It provides energy for all life on Earth."

With overlap (8 chars from end of chunk 1):
  chunk 1: "...Sun rises in the east."
  chunk 2: "he east. It provides energy for all life on Earth."

The overlap means that even if a sentence is cut, the following chunk has enough prior context to produce a coherent embedding.

How much overlap? Experiment with your data. A common starting point is 10–20% of chunk size. For chunk_size=500, try chunk_overlap=50 to 100.

4. Implementation with LangChain

Writing this from scratch is possible but tedious — a naive merge step will combine unrelated paragraphs just to fill the chunk size, which corrupts context. LangChain’s RecursiveCharacterTextSplitter handles these edge cases correctly.

pip install langchain-text-splitters

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=30,
    length_function=len,  # character count; use tiktoken for token count
)

document = """
Nerchuko was founded in 2024 by Ravi Kumar and Priya Singh.
The company operates in the AI education space.

Employees are entitled to 12 casual leaves and 6 sick leaves per year.
Leaves do not carry forward to the next calendar year.
Unused leaves are forfeited at year end.

Work hours are 9 AM to 6 PM Monday through Friday.
Flexible start times between 8 and 10 AM are permitted with manager approval.
Remote work is allowed Monday, Thursday, and Friday.
"""

chunks = splitter.split_text(document.strip())

for i, chunk in enumerate(chunks):
    print(f"[{i}] ({len(chunk)} chars)\n{chunk}\n")

LangChain’s splitter will:

First try to split on \n\n (paragraph boundaries)
If a paragraph is still over 200 chars, split on \n within it
Keep the 30-character overlap between consecutive chunks
Never merge two unrelated paragraphs into one chunk just to hit the size target

5. Using token count instead of character count

Character count is a rough proxy. What the embedding model actually cares about is token count. A single Chinese character is one character but potentially multiple tokens. “tokenization” is one word but 4–5 tokens.

For precise control, use a tokenizer as the length_function:

import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter

tokenizer = tiktoken.get_encoding("cl100k_base")

def token_length(text: str) -> int:
    return len(tokenizer.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,       # 100 tokens per chunk
    chunk_overlap=15,     # 15-token overlap
    length_function=token_length,
)

This is more accurate and the right approach when your embedding model has a token-based context limit (e.g. MiniLM’s 512 tokens).

6. Integrating with ChromaDB

import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Chunk the document
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=40)
chunks = splitter.split_text(document.strip())

# Index in ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="recursive_chunks")

collection.upsert(
    documents=chunks,
    ids=[f"rc_{i}" for i in range(len(chunks))]
)

# Query
results = collection.query(
    query_texts=["What is the remote work policy?"],
    n_results=2
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[{dist:.4f}] {doc}")

7. Choosing chunk size and overlap

There’s no universally correct chunk size. It depends on:

Your embedding model’s context window — chunks must fit within it
Document type — dense technical text may need smaller chunks; narrative prose can handle larger
Query type — short factual queries benefit from small focused chunks; broad summary queries benefit from larger chunks

Practical starting points:

Document type	chunk_size	chunk_overlap
Short FAQ / policies	200–300	30–50
Articles / blog posts	400–600	50–80
Technical documentation	300–500	50–100
Research papers	500–800	100–150

Run a small evaluation: take 10–20 representative questions, retrieve with different settings, and compare which retrieves the correct chunk most often. Tune from there.

What’s next

Part 16 covers sliding window chunking — a strategy that ignores natural boundaries entirely and instead slides a fixed-size window forward by a configurable stride, creating dense overlapping chunks that are especially useful for tasks like semantic search over narrative text.

Full video walkthrough is embedded above.