RAG Chunking: Semantic-Based Splitting

This is Part 17 of the AI Agents series. Parts 13–16 covered fixed-size, sentence-based, recursive character, and sliding window chunking. All four split text based on size. This post covers semantic chunking — splitting based on meaning.

1. The problem with size-based chunking

Every strategy we’ve covered so far makes the same fundamental tradeoff: it treats text as a sequence of characters or words, not as a sequence of ideas.

Consider these two sentences:

“Apple released iPhone 15.”
“I bought an apple at the vegetable market.”

A size-based chunker that has space for both will group them into the same chunk. They fit. But one is about a technology company and the other is about fruit. Embedding them together produces a vector that represents neither concept accurately — you’ve mixed two unrelated topics into one unit.

Semantic chunking looks at what sentences mean before deciding whether they belong together.

2. How semantic chunking works

The algorithm has five steps:

Split the full text into individual sentences
Embed each sentence separately
Compare consecutive sentence embeddings using similarity scores
Evaluate whether the score exceeds a threshold
Merge similar sentences into one chunk, or start a new chunk when similarity drops

The key decision is step 3–4: when the similarity between sentence N and sentence N+1 drops below your threshold, that’s a topic boundary. Start a new chunk.

[S1] Enable two-factor authentication to protect your account.   ─┐
[S2] Choose strong passwords.                                      ├─ similarity > threshold → same chunk (Security)
[S5] Never share your login credentials.                          ─┘

[S3] Items can be returned within 30 days.   ─┐
[S4] Refunds take 5 to 7 business days.      ─┘ → same chunk (Returns & Refunds)

The two chunks are not the same size — that’s intentional. Chunk size is a byproduct of meaning, not a constraint.

3. Similarity score

The similarity between two sentence embeddings is calculated using the dot product of their normalized vectors, which is equivalent to cosine similarity:

$$\text{similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$

Score 1.0: same direction, highly similar meaning
Score 0.0: orthogonal, unrelated topics
Score -1.0: opposite meaning

The threshold is a cutoff you define. If similarity(S_i, S_{i+1}) >= threshold, merge into the current chunk. If it falls below, close the current chunk and start a new one.

Typical starting threshold: 0.7–0.85. Higher thresholds produce more, smaller chunks (stricter grouping). Lower thresholds merge more aggressively.

4. Python implementation from scratch

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


def semantic_chunks(sentences: list[str], threshold: float = 0.75) -> list[str]:
    if not sentences:
        return []

    embeddings = model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i - 1], embeddings[i])
        if sim >= threshold:
            # Same topic — continue current chunk
            current_chunk.append(sentences[i])
        else:
            # Topic boundary — close current chunk, start new one
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    # Don't forget the last chunk
    chunks.append(" ".join(current_chunk))
    return chunks


# Test
sentences = [
    "Python is a high-level programming language.",
    "It is widely used in data science and machine learning.",
    "Python has a simple and readable syntax.",
    "Football is the most popular sport in the world.",
    "The FIFA World Cup is held every four years.",
    "Brazil has won the World Cup five times.",
]

chunks = semantic_chunks(sentences, threshold=0.75)
for i, chunk in enumerate(chunks):
    print(f"\n[Chunk {i}]\n{chunk}")

Expected output:

[Chunk 0]
Python is a high-level programming language. It is widely used in data science and machine learning. Python has a simple and readable syntax.

[Chunk 1]
Football is the most popular sport in the world. The FIFA World Cup is held every four years. Brazil has won the World Cup five times.

Python sentences group together. Football sentences group together. No size limit was specified — the chunks are as long as the topic runs.

5. Consecutive comparison vs all-pairs comparison

The implementation above compares each sentence only to its immediate neighbor. This is the recommended approach for most documents:

Preserves the natural reading order
Computationally cheap — O(n) comparisons for n sentences
Reduces noise from spurious long-range similarities

The alternative — comparing every sentence against every other sentence — is O(n²) and can produce non-sequential chunks (sentence 1 grouped with sentence 47 because they happen to be similar). For documents read linearly, this is usually wrong.

Only fall back to all-pairs if your documents have no natural reading order (e.g. a bag of product reviews).

6. The threshold tradeoff

for threshold in [0.6, 0.75, 0.85, 0.95]:
    chunks = semantic_chunks(sentences, threshold=threshold)
    print(f"threshold={threshold}: {len(chunks)} chunks")

Low threshold (0.6): more sentences merged → fewer, larger chunks → may mix related but distinct topics
High threshold (0.9+): almost every sentence is its own chunk → very granular, but loses grouping benefit

There’s no universally correct value. Test with your actual document and 10–20 representative queries to find where retrieval quality peaks.

7. Embedding model quality matters more here than anywhere else

Semantic chunking depends entirely on the embedding model producing accurate similarity scores. A weak model will fail to group sentences that share a topic if they use different vocabulary.

Example of a failure with a small model:

“The Eiffel Tower is a landmark in Paris.” → embedding A
“It was built in 1889.” → embedding B

A small model may not connect “It” back to “Eiffel Tower” — the similarity drops below threshold and they end up in separate chunks, even though they’re clearly related.

The fix: use a larger, more capable embedding model. Models with 768+ dimensions and trained on diverse corpora handle coreference and topic continuity much better than smaller models.

# Stronger model — better semantic grouping
model = SentenceTransformer("BAAI/bge-base-en-v1.5")  # 768-dim, significantly better

# Or for highest quality (at higher compute cost):
model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct")

If semantic chunking is producing poor groups, the threshold is rarely the problem — the embedding model usually is.

Note on model versions: Model names above are accurate as of May 2026. Embedding model quality improves rapidly. Check the MTEB Leaderboard for the current best models — filter by the task type closest to semantic similarity for the most relevant ranking.

8. Integrating with ChromaDB

import chromadb
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="semantic_chunks")

# Your document sentences (from a sentence splitter or pre-split list)
sentences = [...]  # use sentence_chunks() from Part 14 to split a document

chunks = semantic_chunks(sentences, threshold=0.75)

# Index with the same model used for embedding
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

ef = SentenceTransformerEmbeddingFunction(model_name="BAAI/bge-base-en-v1.5")
collection = client.get_or_create_collection(name="semantic_chunks", embedding_function=ef)

collection.upsert(
    documents=chunks,
    ids=[f"sem_{i}" for i in range(len(chunks))]
)

9. All five chunking strategies: when to use what

Strategy	Split by	Preserves context	Best for
Fixed-size	Character count	Poor	Structured/log data
Sentence-based	Sentence boundaries	Good	Prose documents
Recursive character	Paragraph → sentence → word	Very good	Mixed-format documents
Sliding window	Fixed window + stride	Good	Continuous narrative
Semantic	Meaning similarity	Best	Any document where topic grouping matters

Semantic chunking is the most accurate but also the most compute-intensive — you embed every sentence before indexing begins. For large document sets, that cost adds up. For smaller corpora where retrieval quality is critical, it’s the right choice.

What’s next

Part 18 covers Advanced RAG — the techniques that take a basic pipeline to production scale: query expansion, hybrid search, re-ranking, metadata filtering, multi-stage retrieval, and feedback loops.

Full video walkthrough is embedded above.