LLMsAI AgentsRAG 2026-05-28

RAG Chunking: Recursive Character Splitting

Recursive character splitting is the most practical chunking strategy for real documents — it respects natural boundaries like paragraphs and sentences, falls back gracefully, and uses overlap to preserve cross-boundary context.

This is Part 15 of the AI Agents series. Parts 13–14 covered fixed-size and sentence-based chunking. This post covers recursive character splitting — the most widely used chunking strategy in production RAG systems.


1. Two reasons chunking is non-negotiable

Embedding quality. A 10,000-word document covering politics, sports, and finance produces one embedding that tries to represent all three topics simultaneously. That vector is diluted — it doesn’t focus on anything well. Retrieval against it will be imprecise. Smaller, focused chunks produce sharper vectors.

Context window limits. Every embedding model has a maximum input length (in tokens). Feed it more and it throws an error — it won’t silently truncate.

ModelMax tokens
all-MiniLM-L6-v2512
Cohere embed v3512
MPNet512
text-embedding-3-small (OpenAI)8,192

If your document has 50,000 tokens and your model handles 512, you have no choice but to chunk. The question is how.


2. Recursive character splitting: the strategy

Recursive character splitting uses a priority-ordered list of separators. It tries each one in sequence and only falls back to the next if the resulting chunks are still too large.

Default priority order:

  1. \n\n — paragraph breaks (highest priority)
  2. \n — line breaks
  3. — word boundaries
  4. "" — individual characters (last resort)

The algorithm works like this:

Given text T and target chunk_size:

1. Try splitting T on "\n\n"
   └─ If all pieces fit within chunk_size → done
   └─ If some pieces are still too large → recursively apply to those pieces using "\n"

2. Try splitting oversized pieces on "\n"
   └─ If all pieces fit → done
   └─ Still too large → recurse with " "

3. Try splitting on " "
   └─ Still too large → split on characters (guarantees termination)

4. Merge small adjacent pieces back together up to chunk_size

The key insight: it tries to keep semantically coherent units (paragraphs, then sentences, then words) intact. It only forces a hard character split when there’s no other choice.


3. Chunk overlap

Splitting at any boundary — even a good one like a paragraph — can separate a reference from what it refers to. A paragraph might start with “This approach…” where “this” refers to something in the previous paragraph.

Overlap carries the tail of each chunk into the start of the next, preserving cross-boundary context.

Without overlap:
  chunk 1: "...Sun rises in the east."
  chunk 2: "It provides energy for all life on Earth."

With overlap (8 chars from end of chunk 1):
  chunk 1: "...Sun rises in the east."
  chunk 2: "he east. It provides energy for all life on Earth."

The overlap means that even if a sentence is cut, the following chunk has enough prior context to produce a coherent embedding.

How much overlap? Experiment with your data. A common starting point is 10–20% of chunk size. For chunk_size=500, try chunk_overlap=50 to 100.


4. Implementation with LangChain

Writing this from scratch is possible but tedious — a naive merge step will combine unrelated paragraphs just to fill the chunk size, which corrupts context. LangChain’s RecursiveCharacterTextSplitter handles these edge cases correctly.

pip install langchain-text-splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=30,
    length_function=len,  # character count; use tiktoken for token count
)

document = """
Nerchuko was founded in 2024 by Ravi Kumar and Priya Singh.
The company operates in the AI education space.

Employees are entitled to 12 casual leaves and 6 sick leaves per year.
Leaves do not carry forward to the next calendar year.
Unused leaves are forfeited at year end.

Work hours are 9 AM to 6 PM Monday through Friday.
Flexible start times between 8 and 10 AM are permitted with manager approval.
Remote work is allowed Monday, Thursday, and Friday.
"""

chunks = splitter.split_text(document.strip())

for i, chunk in enumerate(chunks):
    print(f"[{i}] ({len(chunk)} chars)\n{chunk}\n")

LangChain’s splitter will:

  • First try to split on \n\n (paragraph boundaries)
  • If a paragraph is still over 200 chars, split on \n within it
  • Keep the 30-character overlap between consecutive chunks
  • Never merge two unrelated paragraphs into one chunk just to hit the size target

5. Using token count instead of character count

Character count is a rough proxy. What the embedding model actually cares about is token count. A single Chinese character is one character but potentially multiple tokens. “tokenization” is one word but 4–5 tokens.

For precise control, use a tokenizer as the length_function:

import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter

tokenizer = tiktoken.get_encoding("cl100k_base")

def token_length(text: str) -> int:
    return len(tokenizer.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,       # 100 tokens per chunk
    chunk_overlap=15,     # 15-token overlap
    length_function=token_length,
)

This is more accurate and the right approach when your embedding model has a token-based context limit (e.g. MiniLM’s 512 tokens).


6. Integrating with ChromaDB

import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Chunk the document
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=40)
chunks = splitter.split_text(document.strip())

# Index in ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="recursive_chunks")

collection.upsert(
    documents=chunks,
    ids=[f"rc_{i}" for i in range(len(chunks))]
)

# Query
results = collection.query(
    query_texts=["What is the remote work policy?"],
    n_results=2
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[{dist:.4f}] {doc}")

7. Choosing chunk size and overlap

There’s no universally correct chunk size. It depends on:

  • Your embedding model’s context window — chunks must fit within it
  • Document type — dense technical text may need smaller chunks; narrative prose can handle larger
  • Query type — short factual queries benefit from small focused chunks; broad summary queries benefit from larger chunks

Practical starting points:

Document typechunk_sizechunk_overlap
Short FAQ / policies200–30030–50
Articles / blog posts400–60050–80
Technical documentation300–50050–100
Research papers500–800100–150

Run a small evaluation: take 10–20 representative questions, retrieve with different settings, and compare which retrieves the correct chunk most often. Tune from there.


What’s next

Part 16 covers sliding window chunking — a strategy that ignores natural boundaries entirely and instead slides a fixed-size window forward by a configurable stride, creating dense overlapping chunks that are especially useful for tasks like semantic search over narrative text.

Full video walkthrough is embedded above.

Nerchuko Academy · Free DS Interview Prep