
Imagine you're building a customer service chatbot for a software company. Your knowledge base contains hundreds of pages of documentation: installation guides, troubleshooting steps, API references, and user manuals. When a customer asks, "How do I reset my password?", your AI system needs to find the exact paragraph that explains the password reset process — not the entire 50-page user manual.
This is the fundamental challenge of document retrieval: how do you break down large documents into pieces that are just the right size for your AI to understand and retrieve effectively? Too big, and you'll get irrelevant information mixed with what you need. Too small, and you'll lose important context that makes the information meaningful.
By the end of this lesson, you'll understand how to strategically split documents into optimal chunks that dramatically improve your retrieval system's accuracy and relevance.
What you'll learn:
You should have basic familiarity with text processing concepts and understand what a retrieval system does at a high level. No specific programming experience required, though our examples will use Python for clarity.
Before diving into strategies, let's understand why chunking matters. When you feed a document to a retrieval system, the system creates mathematical representations (called embeddings) of the text that capture its meaning. When someone asks a question, the system finds the chunks with embeddings most similar to the question.
Here's the challenge: embeddings work best on coherent pieces of text that focus on a single topic or concept. A 10,000-word research paper about machine learning contains dozens of different concepts. If you treat the entire paper as one chunk, its embedding becomes a fuzzy average of all those concepts — making it hard to retrieve for specific questions.
But if you split that paper into individual sentences, you lose crucial context. A sentence like "This approach reduced error rates by 23%" is meaningless without knowing what "this approach" refers to.
The art of chunking is finding the sweet spot: pieces large enough to maintain context, but focused enough to be precisely retrievable.
Every chunking decision revolves around three core principles:
Semantic Coherence: Each chunk should focus on a single topic or concept. A chunk mixing installation instructions with pricing information will confuse your retrieval system.
Contextual Completeness: Each chunk should contain enough information to be understood on its own. If someone reads just that chunk, they should be able to grasp the main point without needing additional context.
Optimal Granularity: The chunk should be detailed enough to answer specific questions, but not so broad that it includes irrelevant information.
Let's see how different chunk sizes affect retrieval quality with a concrete example.
Consider this excerpt from a software documentation:
User Account Management
Creating New Accounts
To create a new user account, navigate to the Admin Panel and select "User Management."
Click the "Add New User" button and fill in the required fields: username, email, and
initial password. Users will receive an email invitation to activate their account.
Resetting Passwords
If a user forgets their password, they can reset it using the "Forgot Password" link
on the login page. Alternatively, administrators can manually reset passwords through
the Admin Panel by selecting the user and clicking "Reset Password."
Deactivating Accounts
To deactivate a user account, go to User Management, find the user, and click
"Deactivate." Deactivated accounts retain their data but cannot log in.
Poor Chunking (Too Large): If this entire section becomes one chunk, a question about "password reset" might retrieve the whole section, including irrelevant information about creating and deactivating accounts.
Poor Chunking (Too Small): If each sentence becomes a separate chunk, the sentence "Click the 'Add New User' button and fill in the required fields" loses the context that this is about creating accounts.
Good Chunking: Each subsection (Creating New Accounts, Resetting Passwords, Deactivating Accounts) becomes its own chunk, maintaining both focus and completeness.
Fixed-size chunking splits documents into pieces of predetermined length, typically measured in characters or tokens. This is the simplest approach and works well for homogeneous content.
Here's how to implement basic fixed-size chunking:
def fixed_size_chunking(text, chunk_size=1000, overlap=200):
"""
Split text into fixed-size chunks with overlap
Args:
text (str): Input text to chunk
chunk_size (int): Maximum characters per chunk
overlap (int): Characters to overlap between chunks
Returns:
list: List of text chunks
"""
chunks = []
start = 0
while start < len(text):
# Calculate end position
end = start + chunk_size
# Don't split in the middle of a word
if end < len(text):
# Find the last space before the end position
last_space = text.rfind(' ', start, end)
if last_space > start:
end = last_space
chunk = text[start:end].strip()
if chunk: # Only add non-empty chunks
chunks.append(chunk)
# Move start position, accounting for overlap
start = end - overlap
# Prevent infinite loop on very small texts
if start >= end:
break
return chunks
# Example usage
document = """
Artificial Intelligence has transformed how businesses operate across industries.
Machine learning algorithms can now predict customer behavior with remarkable
accuracy, enabling personalized marketing campaigns that drive higher conversion
rates. Natural language processing powers chatbots that handle customer inquiries
24/7, reducing support costs while improving response times. Computer vision
systems automate quality control in manufacturing, detecting defects that human
inspectors might miss.
"""
chunks = fixed_size_chunking(document, chunk_size=200, overlap=50)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
When to use fixed-size chunking:
Advantages:
Disadvantages:
Pro Tip: Always include overlap between chunks to prevent important information from being split across boundaries. A good starting point is 10-20% of your chunk size.
Sentence-based chunking respects natural language boundaries by splitting text at sentence endings, then grouping sentences until reaching a target size.
import re
def sentence_chunking(text, target_size=1000, max_sentences=10):
"""
Chunk text by grouping sentences up to target size
Args:
text (str): Input text to chunk
target_size (int): Target characters per chunk
max_sentences (int): Maximum sentences per chunk
Returns:
list: List of text chunks
"""
# Split into sentences using regex
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_length = len(sentence)
# Check if adding this sentence would exceed limits
if (current_size + sentence_length > target_size and current_chunk) or \
(len(current_chunk) >= max_sentences and current_chunk):
# Finalize current chunk
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_size = sentence_length
else:
current_chunk.append(sentence)
current_size += sentence_length
# Add the last chunk if it exists
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Example with complex sentences
document = """
Data preprocessing is crucial for machine learning success. It involves cleaning
data by removing duplicates, handling missing values, and correcting inconsistencies.
Feature engineering follows preprocessing. This step creates new variables from
existing data to improve model performance. Common techniques include normalization,
scaling, and encoding categorical variables. Model selection comes next.
Different algorithms work better for different types of problems.
Linear regression suits continuous predictions. Decision trees handle both
categorical and numerical data well.
"""
chunks = sentence_chunking(document, target_size=300, max_sentences=4)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
When to use sentence-based chunking:
Advantages:
Disadvantages:
Semantic chunking groups text based on meaning and topic shifts rather than just size. This approach uses techniques like topic modeling or embedding similarity to identify natural breakpoints.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def semantic_chunking(text, similarity_threshold=0.5, min_chunk_size=200):
"""
Chunk text based on semantic similarity between sentences
Args:
text (str): Input text to chunk
similarity_threshold (float): Minimum similarity to keep sentences together
min_chunk_size (int): Minimum characters per chunk
Returns:
list: List of semantically coherent chunks
"""
# Split into sentences
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
if len(sentences) <= 2:
return [text]
# Create TF-IDF vectors for each sentence
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
sentence_vectors = vectorizer.fit_transform(sentences)
# Calculate similarity between adjacent sentences
similarities = []
for i in range(len(sentences) - 1):
sim = cosine_similarity(sentence_vectors[i:i+1], sentence_vectors[i+1:i+2])[0][0]
similarities.append(sim)
# Find break points where similarity drops below threshold
break_points = [0] # Always start with first sentence
for i, sim in enumerate(similarities):
if sim < similarity_threshold:
break_points.append(i + 1)
break_points.append(len(sentences)) # Always end with last sentence
# Create chunks based on break points
chunks = []
for i in range(len(break_points) - 1):
start_idx = break_points[i]
end_idx = break_points[i + 1]
chunk_sentences = sentences[start_idx:end_idx]
chunk_text = ' '.join(chunk_sentences)
# Ensure minimum chunk size
if len(chunk_text) >= min_chunk_size or len(chunks) == 0:
chunks.append(chunk_text)
else:
# Merge with previous chunk if too small
chunks[-1] += ' ' + chunk_text
return chunks
# Example with topic shifts
document = """
Python is a versatile programming language used in data science. Its simple syntax
makes it accessible to beginners. Libraries like pandas and numpy provide powerful
data manipulation capabilities. Machine learning frameworks such as scikit-learn
and TensorFlow integrate seamlessly with Python.
Database management is another critical skill for data professionals. SQL enables
efficient querying of relational databases. NoSQL databases like MongoDB offer
flexibility for unstructured data. Understanding database design principles helps
optimize query performance.
Data visualization transforms complex datasets into understandable insights.
Matplotlib and seaborn create publication-ready charts in Python. Tableau and
Power BI offer drag-and-drop interfaces for business users. Good visualizations
tell a story and guide decision-making.
"""
chunks = semantic_chunking(document, similarity_threshold=0.3)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
When to use semantic chunking:
Advantages:
Disadvantages:
This sophisticated approach uses the document's natural structure — headings, paragraphs, lists — to create semantically meaningful chunks.
import re
def structure_aware_chunking(text, max_chunk_size=1500):
"""
Chunk text based on document structure (headings, paragraphs)
Args:
text (str): Input text to chunk
max_chunk_size (int): Maximum characters per chunk
Returns:
list: List of structurally coherent chunks
"""
# Identify different structural elements
lines = text.split('\n')
elements = []
for line in lines:
line = line.strip()
if not line:
continue
# Identify headings (lines that are short and don't end with punctuation)
if len(line) < 100 and not re.search(r'[.!?]$', line) and line.isupper() or line.startswith('#'):
elements.append(('heading', line))
# Identify list items
elif re.match(r'^\s*[-*•]\s+', line) or re.match(r'^\s*\d+\.\s+', line):
elements.append(('list_item', line))
# Regular paragraph text
else:
elements.append(('paragraph', line))
# Group elements into chunks
chunks = []
current_chunk = []
current_size = 0
current_heading = None
for element_type, content in elements:
if element_type == 'heading':
# Start new chunk with heading
if current_chunk and current_size > 0:
chunks.append('\n'.join(current_chunk))
current_chunk = [content]
current_size = len(content)
current_heading = content
else:
# Check if adding this element would exceed size limit
if current_size + len(content) > max_chunk_size and current_chunk:
chunks.append('\n'.join(current_chunk))
# Start new chunk, include heading for context
current_chunk = [current_heading] if current_heading else []
current_size = len(current_heading) if current_heading else 0
current_chunk.append(content)
current_size += len(content)
# Add the last chunk
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
# Example with structured document
structured_doc = """
DATA PREPROCESSING
Data preprocessing is the first critical step in any machine learning project.
It involves cleaning and transforming raw data into a format suitable for analysis.
Common preprocessing steps include:
- Removing duplicates and irrelevant data
- Handling missing values through imputation or removal
- Normalizing numerical features to similar scales
- Encoding categorical variables into numerical format
FEATURE ENGINEERING
Feature engineering creates new variables from existing data to improve model performance.
This creative process requires domain knowledge and experimentation.
Effective techniques include:
- Creating interaction terms between variables
- Aggregating data at different time periods
- Extracting information from text or image data
- Dimensionality reduction using PCA or similar methods
MODEL SELECTION
Choosing the right algorithm depends on your problem type and data characteristics.
Different algorithms have different strengths and limitations.
"""
chunks = structure_aware_chunking(structured_doc)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n" + "="*50 + "\n")
When to use structure-aware chunking:
Advantages:
Disadvantages:
Let's apply what you've learned by implementing a chunking system for a real-world scenario. You're building a knowledge base for a SaaS company's help documentation.
# Sample help documentation
help_doc = """
GETTING STARTED
Welcome to DataFlow Pro, the comprehensive data analytics platform.
This guide will help you get up and running quickly.
Account Setup
Create your account at dataflow.com/signup. You'll need a valid email address
and company information. After registration, check your email for a verification
link. Click the link to activate your account.
First Login
Use your email and password to log in at dataflow.com/login. You'll be
prompted to complete your profile and set up two-factor authentication
for security.
DASHBOARD OVERVIEW
The main dashboard provides an overview of your data projects and recent activity.
Key sections include project tiles, recent reports, and system notifications.
Navigation Menu
The left sidebar contains the main navigation:
- Projects: View and manage your data projects
- Data Sources: Connect to databases and file uploads
- Reports: Access generated reports and visualizations
- Settings: Account and preference management
Quick Actions
The top toolbar offers quick access to common tasks:
- New Project button creates a blank project
- Upload Data allows direct file imports
- Help icon opens contextual assistance
DATA CONNECTIONS
DataFlow Pro supports multiple data source types including databases,
cloud storage, and direct file uploads.
Supported Databases
- MySQL and PostgreSQL
- Microsoft SQL Server
- Oracle Database
- Amazon Redshift
- Google BigQuery
File Format Support
Upload data in these formats:
- CSV and TSV files
- Excel spreadsheets (.xlsx, .xls)
- JSON files
- Parquet format
"""
# Exercise: Implement and compare different chunking strategies
def compare_chunking_strategies(text):
"""Compare different chunking approaches on the same text"""
print("=== FIXED-SIZE CHUNKING ===")
fixed_chunks = fixed_size_chunking(text, chunk_size=400, overlap=50)
for i, chunk in enumerate(fixed_chunks[:3]): # Show first 3
print(f"Chunk {i+1} ({len(chunk)} chars):\n{chunk}\n")
print(f"Total chunks: {len(fixed_chunks)}\n")
print("=== SENTENCE-BASED CHUNKING ===")
sentence_chunks = sentence_chunking(text, target_size=400, max_sentences=5)
for i, chunk in enumerate(sentence_chunks[:3]):
print(f"Chunk {i+1} ({len(chunk)} chars):\n{chunk}\n")
print(f"Total chunks: {len(sentence_chunks)}\n")
print("=== STRUCTURE-AWARE CHUNKING ===")
structure_chunks = structure_aware_chunking(text, max_chunk_size=500)
for i, chunk in enumerate(structure_chunks[:3]):
print(f"Chunk {i+1} ({len(chunk)} chars):\n{chunk}\n")
print(f"Total chunks: {len(structure_chunks)}\n")
# Run the comparison
compare_chunking_strategies(help_doc)
Your Task: Run this comparison and analyze the results. Which strategy creates the most coherent chunks for this help documentation? Why?
Questions to Consider:
How do you know if your chunking strategy is working well? Here are key metrics to track:
Retrieval Precision: When users ask questions, what percentage of retrieved chunks actually contain relevant information? Good chunking should minimize irrelevant results.
Context Preservation: Can humans understand each chunk without additional context? Test this by randomly sampling chunks and seeing if they make sense in isolation.
Coverage Completeness: Do your chunks capture all important information from the original documents? Look for concepts that might fall between chunk boundaries.
Here's a simple evaluation framework:
def evaluate_chunks(chunks, sample_questions):
"""
Simple evaluation of chunk quality
Args:
chunks (list): List of text chunks
sample_questions (list): Questions your system should answer
Returns:
dict: Evaluation metrics
"""
metrics = {
'total_chunks': len(chunks),
'avg_chunk_length': np.mean([len(chunk) for chunk in chunks]),
'chunk_length_std': np.std([len(chunk) for chunk in chunks]),
'empty_chunks': sum(1 for chunk in chunks if len(chunk.strip()) < 50)
}
# Check for very short or very long chunks
very_short = sum(1 for chunk in chunks if len(chunk) < 100)
very_long = sum(1 for chunk in chunks if len(chunk) > 2000)
metrics['very_short_chunks'] = very_short
metrics['very_long_chunks'] = very_long
print("Chunking Evaluation Results:")
print(f"Total chunks: {metrics['total_chunks']}")
print(f"Average length: {metrics['avg_chunk_length']:.0f} characters")
print(f"Length std dev: {metrics['chunk_length_std']:.0f}")
print(f"Very short chunks (<100 chars): {very_short}")
print(f"Very long chunks (>2000 chars): {very_long}")
print(f"Nearly empty chunks: {metrics['empty_chunks']}")
return metrics
# Example evaluation
sample_questions = [
"How do I create an account?",
"What file formats are supported?",
"How do I connect to PostgreSQL?",
"Where is the navigation menu?"
]
# Evaluate different chunking results
print("Evaluating Structure-Aware Chunks:")
structure_chunks = structure_aware_chunking(help_doc)
evaluate_chunks(structure_chunks, sample_questions)
Mistake 1: Ignoring Document Type Using fixed-size chunking for structured documents like legal contracts or technical manuals. These documents have natural hierarchies that should be preserved.
Solution: Start with structure-aware chunking for formatted documents, fall back to sentence-based for unstructured text.
Mistake 2: No Overlap Between Chunks Creating hard boundaries between chunks can split important information.
Solution: Always include 10-20% overlap, especially with fixed-size chunking.
Mistake 3: One-Size-Fits-All Approach Using the same chunking strategy for all content types in your system.
Solution: Develop a content-type classifier that selects the appropriate chunking strategy based on document characteristics.
Mistake 4: Ignoring Chunk Size Distribution Creating chunks with wildly different sizes, making retrieval inconsistent.
Solution: Monitor chunk size distribution and adjust parameters to maintain reasonable consistency.
Mistake 5: Not Testing with Real Queries Optimizing chunks without testing how well they answer actual user questions.
Solution: Create a test set of common queries and measure retrieval quality regularly.
Troubleshooting Tip: If your retrieval system returns too much irrelevant information, your chunks are probably too large. If it misses important context, they're likely too small.
Performance Issues Semantic chunking can be slow on large documents. For production systems:
Inconsistent Results If chunking produces wildly different results for similar documents:
As you develop more sophisticated systems, consider these advanced techniques:
Hierarchical Chunking: Create chunks at multiple levels — paragraphs, sections, and full documents — allowing retrieval at different granularities.
Query-Aware Chunking: Adjust chunk boundaries based on common query patterns. If users frequently ask about "installation steps," ensure those steps stay together in chunks.
Dynamic Chunking: Modify chunk size based on content density. Dense technical sections might need smaller chunks, while narrative sections can be larger.
Cross-Reference Preservation: For documents with internal references ("see Section 3.2"), maintain metadata linking chunks to preserve these relationships.
Effective document chunking is both an art and a science. The four strategies we've covered each have their place:
The key insight is that chunking strategy should match your content type and use case. Start with structure-aware chunking for formatted documents, then experiment with semantic or sentence-based approaches if needed.
Remember that chunking is not a set-it-and-forget-it process. Monitor your retrieval system's performance, gather user feedback, and continuously refine your approach. The best chunking strategy is the one that helps your users find the information they need quickly and accurately.
Next Steps:
As you advance in building retrieval systems, you'll learn to combine chunking with other techniques like query expansion, re-ranking, and context augmentation to create even more powerful information retrieval experiences.
Learning Path: RAG & AI Agents