RAG Chunking: Sentence-Based Splitting
Fixed-size chunking breaks sentences mid-word. Sentence-based chunking fixes that by treating each complete sentence as its own chunk — better context, better vectors, better retrieval.
This is Part 14 of the AI Agents series. Part 13 covered fixed-size chunking and why it produces bad vectors. This post covers the next step up: sentence-based chunking, which preserves semantic completeness in every chunk.
1. The core idea
In sentence-based chunking, each sentence becomes one chunk. Instead of splitting every N characters, you split at sentence boundaries — full stops, question marks, exclamation marks.
Input:
The sun is a star. Earth orbits the sun. The moon orbits earth. Solar energy powers our planet.
Output chunks:
[0] "The sun is a star."
[1] "Earth orbits the sun."
[2] "The moon orbits earth."
[3] "Solar energy powers our planet."
Every chunk carries a complete, self-contained thought. The embedding model can represent each one accurately. The result is sharper vectors and more precise retrieval.
2. Why this is better than fixed-size chunking
The embedding model needs complete semantic units to produce meaningful vectors. A sentence is the natural unit of meaning in written language — it has a subject, a predicate, and a complete thought.
Fixed-size chunking cuts across that boundary arbitrarily. Sentence-based chunking respects it.
| Property | Fixed-size | Sentence-based |
|---|---|---|
| Context preserved | No | Yes |
| Grammar intact | No | Yes |
| Vector quality | Low | High |
| Implementation complexity | Trivial | Low–Medium |
| Handles abbreviations | N/A | Needs care |
3. The naive approach and why it breaks
The obvious implementation — split on ., ?, ! — fails immediately on real text:
text = "Dr. Smith earned $9.5M from U.S.A. operations. He works at Stanford."
chunks = text.split(".")
# ['Dr', ' Smith earned $9', '5M from U', 'S', 'A', ' operations', ' He works at Stanford', '']
Three problems:
- Abbreviations:
Dr.,Mr.,Mrs.,Prof.contain dots that don’t end sentences - Acronyms:
U.S.A.splits into individual letters - Decimal numbers:
9.5splits at the decimal point
All of these are dots that should not trigger a sentence split.
4. A smart sentence chunker
import re
def sentence_chunks(text: str) -> list[str]:
# Protect abbreviations: Dr. Mr. Mrs. Miss. Prof.
protected = re.sub(r'\b(Dr|Mr|Mrs|Miss|Prof)\.\s', r'\1<PERIOD> ', text)
# Protect acronyms: sequences of single uppercase letters separated by dots (U.S.A.)
protected = re.sub(r'\b([A-Z]\.){2,}', lambda m: m.group().replace('.', '<PERIOD>'), protected)
# Protect decimal numbers: digits.digits
protected = re.sub(r'(\d)\.(\d)', r'\1<PERIOD>\2', protected)
# Split on sentence-ending punctuation followed by whitespace or end of string
raw_chunks = re.split(r'(?<=[.!?])\s+', protected)
# Restore protected periods and clean up
chunks = [
chunk.replace('<PERIOD>', '.').strip()
for chunk in raw_chunks
if chunk.strip()
]
return chunks
# Test
text = (
"Dr. Smith earned $9.5M from U.S.A. operations. "
"He joined the company in 2019! "
"Did Prof. Jones approve the budget? "
"The growth rate was 12.3% last quarter."
)
chunks = sentence_chunks(text)
for i, chunk in enumerate(chunks):
print(f"[{i}] {chunk}")
Output:
[0] Dr. Smith earned $9.5M from U.S.A. operations.
[1] He joined the company in 2019!
[2] Did Prof. Jones approve the budget?
[3] The growth rate was 12.3% last quarter.
The abbreviations, acronyms, and decimal are all preserved. Each split produces a complete sentence.
5. Integrating with ChromaDB
Once you have clean sentence chunks, indexing them is identical to Part 12:
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="sentence_chunks")
document = """
Dr. Smith leads the AI research team at Nerchuko. The company was founded in 2024.
Nerchuko offers 12 casual leaves and 6 sick leaves per year. Work hours are 9 AM to 6 PM.
Employees in the U.S.A. office follow Eastern Time. The average team size is 8.5 members per pod.
"""
chunks = sentence_chunks(document.strip())
collection.upsert(
documents=chunks,
ids=[f"s_{i}" for i in range(len(chunks))]
)
# Query
results = collection.query(
query_texts=["How many sick leaves do employees get?"],
n_results=2
)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{dist:.4f}] {doc}")
Each sentence is now an independent, semantically complete vector in the database. Retrieval precision is substantially higher than with fixed-size chunks.
6. Limitations of sentence-based chunking
Sentence-based chunking is a significant improvement over fixed-size, but it has its own edge cases:
- Very short sentences lack context.
"He did it."as a standalone chunk produces a poor vector — who is “he” and what did he do? Adjacent context matters. - Very long sentences can carry multiple distinct facts, diluting the embedding. A 200-word sentence may embed the same way as a 10-word one, losing granularity.
- Lists and bullet points often don’t end in sentence punctuation. Your splitter won’t catch them.
For documents with consistent prose — articles, handbooks, reports — sentence chunking works well out of the box. For mixed-format documents, you may need to combine strategies.
What’s next
Part 15 covers recursive split chunking — a smarter strategy that tries to split on paragraph boundaries first, then sentences, then words, only falling back to character-level splits as a last resort. This handles mixed-format documents better than either fixed-size or sentence-based chunking alone.
Full video walkthrough is embedded above.