Architecting Semantic Chunking Pipelines for High-Performance RAG
Master critical chunking strategies for RAG to enhance retrieval accuracy and context retention.
By Kuriko IWAI

Table of Contents
IntroductionWhat is ChunkingIntroduction
While Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI in private data, the quality of its output is only as good as the information it retrieves.
To ensure the model receives the most relevant context, one must move beyond simple data ingestion and focus on carefully crafted retrieval pipelines.
At the heart of this process lies chunking, a critical technique for breaking down large datasets into digestible, semantically meaningful segments (chunks).
In this article, I'll explore the underlying mechanisms of chunking and evaluate the major strategies used to optimize performance in modern AI applications.
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
What is Chunking
Chunking is the process of breaking down large bodies of text into smaller, manageable pieces (chunks) before they are converted into mathematical representations called embeddings.
The below diagram illustrate the data ingestion pipeline and how chunking plays a key role in the pipeline:

Figure A. Technical diagram of the RAG data ingestion pipeline illustrating the flow from raw data to chunking, embedding, and high-dimensional vector storage (Created by Kuriko IWAI)
The process begins with raw, unstructured data in various formats like text (PDFs, docs), images, audio, and video.
The chunking process happens after the retrieval (second box, Figure A) where the data is broken down into smaller pieces called chunks.
By splitting a long document into smaller paragraphs or segments, the system can later retrieve only the specific part that answers a user's question, rather than the entire file.
Then each chunk is passed through an embedding model (an AI algorithm) that converts the content into a vector embedding, a long list of numbers (coordinates) that represent the semantic meaning of the chunk.
Lastly, these embeddings are stored in a vector store (database).
The vector space in Figure A has three dimensions for demonstration purpose, but usually the space is high-dimensional.
These data points are stored close together when they are considered related.
For example, if we have a chunk about "Golden Retrievers" and another about "Labradors," they will be mathematically near each other in the database, allowing the LLM to find relevant information almost instantly.
◼ Why Chunking Matters
Chunking matters for the following reasons:
Relevance: Small chunks ensure that RAG can retrieve the exact piece from the large document.
Cost efficiency: Processing smaller, targeted snippets saves on tokens and computation time.
Context retention: Well-chunked data maintains enough surrounding information to enable the LLM to comprehend the context on why and how.
Overall, chunking helps the system to retrieve the most relevant context to the user query, while saving input tokens (and fit the context into the LLM's context window).
Comparative Analysis: 5 Industry-Standard Chunking Strategies
Selecting a chunking strategy is not a one-size-fits-all decision.
This section explores five major chunking strategies:
Fixed-size chunking.
Recursive character chunking.
Document-specific chunking.
Semantic chunking.
Parent–Child (Hierarchical) chunking.
To understand how these methods diverge in practice, we will apply each to a sample text regarding Solar Energy Infrastructure.
▫ Sample Text
I'll use the sample text to see how each chunking strategy splits the text:
1text = "Solar panels, or photovoltaic cells, convert sunlight into electricity. This process happens at the atomic level. Some materials exhibit a property known as the photoelectric effect. This causes them to absorb photons and release electrons. Beyond the cells, an inverter is required to convert DC to AC. Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours."
2
◼ Fixed-Size Chunking
The fixed-size chunking is the most straightforward approach to define a specific number of characters or tokens per chunk.
The below diagram illustrates how it works:

Figure B. Visualization of fixed-size chunking showing the sliding window mechanism with defined chunk size and yellow-highlighted overlap section (Created by Kuriko IWAI)
For example, the chunk size (Green cells, Figure B) and chunk overlap (or called sliding windows) (yellow cells, Figure B) are set to 500 tokens / 50 tokens respectively.
The overlap ensures that context isn't lost if a key sentence is split in half.
▫ Pros
Computationally affordable and easy to implement.
▫ Best For
Quick prototyping.
Handling simple text.
General use cases where speed is prioritized over granular semantic accuracy.
▫ Practical Implementation
The CharacterTextSplitter class from the langchain_text_splitters library can split the sample text:
1from langchain_text_splitters import CharacterTextSplitter
2
3# fixed size chunking
4fixed_splitter = CharacterTextSplitter(
5 separator="",
6 chunk_size=100,
7 chunk_overlap=20
8)
9
10fixed_chunks = fixed_splitter.split_text(text)
11
The configuration shows 100 tokens for each chunk with 20 tokens overlapped.
▫ Resulting Chunks:
Each chunk contains exactly 100 tokens:
Chunk 1: "Solar panels, or photovoltaic cells, convert sunlight into electricity. This process happens at the a"
Chunk 2: "ns at the atomic level. Some materials exhibit a property known as the photoelectric effect. This ca"
In this method, a word like "atomic" is sliced.
Although computationally fastest, the method can create semantic noise as the LLM receives partial words, which can degrade the quality of the generated response.
◼ Recursive Character Chunking
The recursive character text splitting uses a hierarchy of separators to find a natural breaking point, instead of cutting text at a hard character limit.
The below diagram illustrates how the method works:

Figure C. Diagram of recursive character splitting logic demonstrating the hierarchical priority of separators like newlines and periods to preserve sentence integrity (Created by Kuriko IWAI)
The method first attempts to split by the most significant separator, period, and then moves down the list for commas and spaces, until the chunk size requirement (pink box, Figure C) is met.
▫ Pros
- Keeps related ideas together better than fixed-size splitting.
▫ Best For
Maintaining the integrity of paragraphs and sentences.
Articles and blogs.
▫ Practical Implementation
The RecursiveCharacterTextSplitter class from the langchain_text_splitters library can split the sample text:
1from langchain_text_splitters import RecursiveCharacterTextSplitter
2
3recursive_splitter = RecursiveCharacterTextSplitter(
4 chunk_size=100,
5 chunk_overlap=20,
6 separators=["\n\n", "\n", " ", ""]
7)
8
9recursive_chunks = recursive_splitter.split_text(text)
10
The configuration shows 100 tokens for each chunk with 20 tokens overlapped.
The splitters have priority of double lines, single lines, double spaces, and single spaces.
▫ Resulting Chunks
Each chunk contains exactly 100 tokens such that:
Chunk 1: 'Solar panels, or photovoltaic cells, convert sunlight into electricity.'
Chunk 2: 'This process happens at the atomic level. Some materials exhibit a property'
Compared to the fixed-size chunking method, the recursive splitter identifies the period at the end of the first sentence and stops there, preserving grammatical integrity.
This makes the method more readable for the LLM.
◼ Document-Specific Chunking
The document-specific chunking respects the inherent format of the file types like Markdown, HTML, LaTeX, or Code.
For example:
Markdown: Splits by headers (#, ##, ###).
HTML: Splits by html tags, comma, or dots.
Code: Splits by function or class definitions.
▫ Pros
- Preserves the logical hierarchy of the document.
▫ Best For
Highly structured technical documentation like PDF reports, manuals.
Codebases.
▫ Practical Implementation
The MarkdownHeaderTextSplitter class from the langchain_text_splitter library can define the splitters and split the document:
1from langchain_text_splitter import MarkdownHeaderTextSplitter
2
3markdown_document = """
4# Solar Energy Guide
5
6## The Physics
7Solar panels, or photovoltaic cells, convert sunlight into electricity.
8This process happens at the atomic level.
9Some materials exhibit the photoelectric effect.
10
11## The Hardware
12Beyond the cells, an inverter is required to convert DC to AC.
13Large-scale solar farms also require battery storage systems.
14
15### Maintenance
16Regular cleaning of panels ensures maximum photon absorption.
17"""
18
19# define the headers
20headers_to_split_on = [
21 ("#", "Header 1"),
22 ("##", "Header 2"),
23 ("###", "Header 3"),
24]
25
26# initialize the splitter
27markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
28
29# split
30md_header_splits = markdown_splitter.split_text(markdown_document)
31
▫ Resulting Chunks:
Each chunk has been sliced by the headers defined in the code snippets:
Chunk 1.
Content: Solar panels, or photovoltaic cells, convert sunlight into electricity. This pro...
Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Physics'}
Chunk 2.
Content: Beyond the cells, an inverter is required to convert DC to AC. Large-scale solar...
Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Hardware'}
Chunk 3.
Content: Regular cleaning of panels ensures maximum photon absorption....
Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Hardware', 'Header 3': 'Maintenance'}
The method can leverage the author's intent, ensuring that "The Physics" and "The Hardware" are never accidentally blended into the same chunk, which is vital for technical manuals.
◼ Semantic Chunking
The semantic chunking is a more advanced technique that calculates the distance of the meaning to determine where a topic changes.
The method looks at the cosine similarity between the embeddings of consecutive sentences.
When the similarity drops below a certain threshold, it assumes a new topic has started and creates a break to move onto a new chunk.
▫ Pros
- High retrieval accuracy because chunks represent complete ideas.
▫ Best For
Academic paper.
Narrative-heavy documents.
Long-form essays (paragraphs don't align with the topic shift).
▫ Practical Implementation
I'll first create vector embeddings using the SentenceTransformer class from the sentence_transformers library.
Then, split the embeddings into smaller chunks by calculating cosign similarity scores between embeddings.
1from sklearn.metrics.pairwise import cosine_similarity
2from sentence_transformers import SentenceTransformer
3
4# split the text into sentences
5sentences = split_into_sentences(text)
6
7# create vector embeddings
8model = SentenceTransformer(MODEL)
9embeddings = model.encode(sentences)
10
11# compute cosine similarity to split into chunks
12chunks = []
13current_chunk = [sentences[0]]
14for i in range(1, len(sentences)):
15 prev_emb = embeddings[i - 1].reshape(1, -1)
16 curr_emb = embeddings[i].reshape(1, -1)
17
18 similarity = cosine_similarity(prev_emb, curr_emb)[0][0]
19 print(f"Similarity between sentence {i-1} and {i}: {similarity:.3f}")
20
21 # if similarity drops → meaning shift → split
22 if similarity < similarity_threshold:
23 chunks.append(" ".join(current_chunk))
24 current_chunk = [sentences[i]]
25 else:
26 current_chunk.append(sentences[i])
27
28# add last chunk
29if current_chunk:
30 chunks.append(" ".join(current_chunk))
31
32return chunks
33
▫ Resulting Chunks
In this method, the system notices a shift in meaning between "release electrons" and "Beyond the cells."
Chunk 1. Solar panels, or photovoltaic cells, convert sunlight into electricity.
- Cosign similarity: 0.196
Chunk 2. This process happens at the atomic level.
- Cosign similarity: 0.313
Chunk 3. Some materials exhibit a property known as the photoelectric effect.
- Cosign similarity: 0.546
Chunk 4. This causes them to absorb photons and release electrons.
- Cosign similarity: 0.266
Chunk 5. Beyond the cells, an inverter is required to convert DC to AC.
- Cosign similarity: 0.336
Chunk 6. Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.
Semantic chunking recognizes that the topic changed from physics to engineering and forces a split.
◼ Parent–Child (Hierarchical) Chunking
The parent-child (hierarchical) chunking involves storing two versions of the same data: a small child chunk for searching and a larger parent chunk for context.
The system first searches against small, highly specific chunks (e.g., 100 tokens).
Then, once a match is found, it retrieves the larger surrounding parent document (e.g., 1000 tokens) to provide to the LLM.
▫ Pros
- Avoids the lost in the middle problem by giving the LLM plenty of background info.
▫ Best For
Enterprise-grade RAG systems.
Balancing high-precision search with comprehensive context.
▫ Practical Implementation
I'll first create parent chunk which contains all the sample text, and then create child chunks which contains vector embeddings:
1import re, uuid
2from sentence_transformers import SentenceTransformer
3
4# create parent chunk
5parent_chunk = {
6 "id": str(uuid.uuid4()),
7 "text": text.strip() # the entire sample text
8}
9
10
11# load embedding model
12model = SentenceTransformer("all-MiniLM-L6-v2")
13
14def create_child_chunks(parent_chunk):
15 sentences = split_into_sentences(parent_chunk["text"])
16
17 children = []
18 for sentence in sentences:
19 child = {
20 "id": str(uuid.uuid4()),
21 "parent_id": parent_chunk["id"],
22 "text": sentence,
23 "embedding": model.encode(sentence)
24 }
25 children.append(child)
26 return children
27
28child_chunks = create_child_chunks(parent_chunk)
29
▫ Results
Parent context:
Solar panels, or photovoltaic cells, convert sunlight into electricity.
This process happens at the atomic level.
Some materials exhibit a property known as the photoelectric effect.
This causes them to absorb photons and release electrons.
Beyond the cells, an inverter is required to convert DC to AC.
Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.
Matched child sentence:
Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.
In the process, the retriever pulls the parent chunk when it finds that the query matches some vector embedding, instead of returning just the matched sentence.
This enables the LLM to receive the full paragraph and assess the relationship between solar panels, inverters, and battery storage. In other words, it explains the "Why" (infrastructure) rather than just the "What" (batteries).
Wrapping Up
Chunking is not just slicing data.
With a proper strategy, it can work as semantic optimization.
Proper grouping ensures that when a user asks a question, the vector database returns a coherent piece of information rather than a fragmented snippet that leaves the AI guessing.
Here are key strategies to consider when it comes to choosing the optimal chunking strategies.
◼ Implementation Roadmap: Choosing Optimal Strategy
▫ 1. Identify Patterns & Logical Structures
Look for repetitions, sequences, or inherent connections in the data.
In a technical context, this means identifying document headers, Markdown tags, or paragraph breaks to ensure a chunk doesn't cut off in the middle of a vital sentence or thought.
- The key question:
Does my data have a predictable layout or specific syntax that carries meaning?
- Strategy to choose: Document-specific (structure-aware) chunking.
Use parsers that respect the document's native format (e.g., Markdown, HTML, or LaTeX) to split data at logical boundaries rather than arbitrary character counts.
▫ 2. Maintain Semantic Context (The Mnemonic for AI)
Just as humans use mnemonics to link ideas, AI systems use overlapping chunks.
By including a small portion of the previous chunk at the start of the next one, you create a narrative bridge that prevents the model from losing the broader context of the data.
- The key question to ask:
If I read this chunk in isolation, would I still understand the subject of the sentence?
- Strategy to choose: Fixed-sized chunking with sliding window (overlapping).
Implementing a context window of 10–20% overlap between chunks ensures that the end of one chunk and the beginning of the next share enough connective tissue to maintain semantic flow.
▫ 3. Prioritize Categorical Grouping
Organize information by category or hierarchy (e.g., grouping a grocery list by "produce" or "dairy").
- The key question to ask:
How granular is the information my users are looking for—specific facts or broad overviews?
- Strategy to choose: Recursive Character Splitting.
Start with a large separator (like a double newline) and progressively move to smaller separators (space, character) until the desired chunk size is reached. This keeps neighboring ideas in the same bucket.
◼ Optimize for Retrieval & Relevance
Lastly, in either chunking strategy we choose, it is best to regularly test the chunk size against real-world queries because if chunks are too small, they lack context; if they are too large, they introduce noise that can confuse the LLM.
The key question one can ask is:
Am I retrieving irrelevant fluff that wastes my model's context window, or am I missing the answer entirely?
And if this is the case, experimental benchmarking would work the best.
The experimental benchmarking runs chunking with different chunk sizes (e.g., 256, 512, and 1024 tokens), and evaluates each of them using metrics like Hit Rate or MRR (Mean Reciprocal Rank).
It allows one to determine which size consistently yields the most accurate answers for a task in hand.
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.
FAQ
1) Why is chunk overlap necessary in fixed-size chunking?
👉 Overlap creates a 'sliding window' that ensures semantic context isn't lost if a key sentence or concept is split across two chunks. It provides a narrative bridge for the LLM.
2) When should I use Semantic Chunking over Recursive Splitting?
👉 Use Semantic Chunking for narrative-heavy or long-form documents where topic shifts don't align with paragraph breaks. It is more computationally expensive but offers higher retrieval precision.
3) What is the 'Lost in the Middle' problem in RAG?
👉 It is a phenomenon where LLMs struggle to extract information from the middle of a very long context window. Parent-Child chunking helps mitigate this by providing targeted, relevant context.
4) Which metric should I use to evaluate my chunking strategy?
👉 Hit Rate (the frequency of the correct chunk being retrieved) and Mean Reciprocal Rank (MRR) are the industry standards for benchmarking retrieval relevance.
5) Can I use different chunking strategies for the same vector store?
👉 Yes, but it is generally discouraged unless you are using a hybrid retrieval system. Consistent chunking ensures a uniform embedding space and predictable retrieval behavior.
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
Share What You Learned
Kuriko IWAI, "Architecting Semantic Chunking Pipelines for High-Performance RAG" in Kernel Labs
https://kuriko-iwai.com/rag-chunking-strategies-technical-guide
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)
Understanding Vector Databases and Embedding Pipelines
How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation


