Comprehensive Guide to Chunking in LLM and RAG Systems
With the increasing use of Large Language Models (LLMs) like GPT-3, GPT-4, and Retrieval-Augmented Generation (RAG) systems, one major challenge is efficiently handling long-form content such as articles, documents, or code. LLMs typically have token limits (e.g., 4096 tokens in GPT-3), restricting the amount of text that can be processed at one time. To tackle this, we need to break down these large inputs into smaller, manageable pieces using a technique called chunking.
This blog provides a comprehensive guide to chunking in the context of LLMs and RAG, covering:
- What chunking is in the context of LLM and RAG.
- Why chunking is essential in LLM and RAG systems.
- Chunking methods provided by LangChain.
- Best practices for implementing chunking.
- Detailed examples and code snippets.
1. What is Chunking in LLM and RAG Context?
Chunking in the context of LLMs refers to dividing large documents or texts into smaller pieces called chunks. This is necessary because LLMs have token limits, which restrict how much text they can process at once. For instance, GPT-3 has a token limit of 4096, including both input and output tokens.
In RAG systems, which combine retrieval models (to fetch relevant text) and generation models (to generate responses), chunking is used to ensure that large documents are divided into smaller chunks that can be efficiently retrieved and processed.
2. Why is Chunking Essential?
Chunking is vital for several reasons:
- Token Management: It ensures that input text doesn’t exceed the token limit of LLMs.
- Retrieval Accuracy in RAG: Smaller chunks allow the retrieval model to fetch relevant segments more accurately, leading to better results.
- Context Preservation: Chunking with overlap helps maintain the context between chunks, making it easier for LLMs to generate coherent responses.
- Efficiency: Smaller, structured chunks allow for faster processing and retrieval.
3. Chunking Methods in LangChain
LangChain provides multiple text-splitting utilities tailored for various types of content, from simple text to complex code. Below are the chunking methods you can use with LangChain:
3.1. CharacterTextSplitter
The CharacterTextSplitter
is the simplest splitter, where text is chunked based on a fixed number of characters. This method is ideal for structured content where preserving specific semantic meaning may not be critical.
Code Example:
from langchain.text_splitter import CharacterTextSplitter
# Define the splitter
text_splitter = CharacterTextSplitter(
chunk_size=1000, # Each chunk will be 1000 characters
chunk_overlap=100 # There will be an overlap of 100 characters between chunks
)
# Sample text to split
text = "This is a very long document that needs to be split into smaller chunks."
# Perform the split
chunks = text_splitter.split_text(text)
# Print the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Use Case: Best for structured documents where you want to split based on character count, such as articles or plain text without heavy emphasis on preserving paragraphs or sentences.
3.2. RecursiveCharacterTextSplitter (Semantic Chunking)
The RecursiveCharacterTextSplitter
is more advanced, as it splits content based on semantic units like sentences and paragraphs. It’s perfect for documents like articles, reports, and blogs where preserving semantic meaning is crucial.
Code Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Define the splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Each chunk will be 1000 characters
chunk_overlap=100 # Overlap of 100 characters between chunks
)
# Sample text
text = "This is a long document divided into paragraphs and sentences."
# Perform the split
chunks = text_splitter.split_text(text)
# Print the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Use Case: Ideal for articles, blogs, and other natural language content where you want to maintain sentence and paragraph structures.
3.3. PythonCodeTextSplitter (Code Chunking)
For handling code, LangChain provides a specialized splitter called PythonCodeTextSplitter
. This splitter ensures that functions or classes are not split across chunks, maintaining the logical flow of the code.
Code Example:
from langchain.text_splitter import PythonCodeTextSplitter
# Define the Python code splitter
splitter = PythonCodeTextSplitter(
chunk_size=500, # Each chunk will be 500 characters
chunk_overlap=50 # Overlap of 50 characters between chunks
)
# Sample Python code to split
code = """
def function_1():
print("Hello from function 1")
def function_2():
print("Hello from function 2")
"""
# Perform the split
chunks = splitter.split_text(code)
# Print the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Use Case: Suitable for processing code where preserving the logical flow of functions and classes is critical.
3.4. RecursiveCharacterTextSplitter for JavaScript and Other Languages
LangChain also allows chunking for other programming languages like JavaScript, HTML, and Markdown. This is done using the RecursiveCharacterTextSplitter.from_language()
method, where you specify the language syntax.
JavaScript Code Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language
# Define the JavaScript splitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS, # Choose JavaScript language
chunk_size=60, # Each chunk will be 60 characters
chunk_overlap=10 # Overlap of 10 characters
)
# Sample JavaScript code
js_code = """
function greet() {
console.log("Hello, World!");
}
greet();
"""
chunks = splitter.split_text(js_code)
# Print the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Use Case: This method is useful when handling non-Python code like JavaScript or HTML.
3.5. Mardown TextSplitter
Markdown content often follows a hierarchical structure (headings, paragraphs, lists, etc.), making it necessary to chunk based on these semantic units.
Code Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language
# Define the Markdown splitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN,
chunk_size=500,
chunk_overlap=100
)
# Sample Markdown document
markdown_text = """
# Heading 1
This is a paragraph under heading 1.
## Subheading
This is some more text under the subheading.
"""
# Perform the split
chunks = splitter.split_text(markdown_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
4. Best Practices for Effective Chunking
Here are some best practices to consider when chunking text for LLM and RAG systems:
- Chunk Size and Overlap: Experiment with different chunk sizes and overlaps to find the right balance between context preservation and processing efficiency. A chunk size of 500–1000 characters with an overlap of 100–200 characters often works well.
- Semantic Chunking: Use recursive splitters for natural language or unstructured text to maintain sentence and paragraph integrity.
- Code Handling: When dealing with code, always use language-specific chunking (e.g.,
PythonCodeTextSplitter
) to avoid breaking functions or logical blocks. - Document Structure Awareness: For documents with distinct sections (e.g., Markdown or HTML), ensure that headings and section boundaries are preserved by using specialized splitters like
MarkdownTextSplitter
. - Token Considerations: If using an LLM like GPT, ensure that the chunks do not exceed the model’s token limit, which includes both input and output tokens.
Conclusion
Chunking is a critical technique in LLM and RAG systems, enabling models to process long texts efficiently while preserving context and meaning. Whether you’re dealing with natural language content or complex code, LangChain offers a wide range of splitters that cater to different use cases.
By choosing the appropriate chunking method — be it CharacterTextSplitter
, RecursiveCharacterTextSplitter
, or PythonCodeTextSplitter
—you can ensure that your content is split in a way that maximizes both performance and accuracy.
For discussion on AI related work:
kshitijkutumbe@gmail.com