text_splitter#
Splitting texts is commonly used as a preprocessing step before embedding and retrieving texts.
We encourage you to process your data here and define your own embedding and retrieval methods. These methods can highly depend on the product environment and may extend beyond the scope of this library.
However, the following approaches are commonly shared:
Document Storage: Define how to store the documents, both raw and chunked. For example, LlamaIndex uses Document Stores to manage ingested document chunks.
Document Chunking: Segment documents into manageable chunks suitable for further processing.
Vectorization: Embed each chunk and store the resulting vectors in Vector Stores. For example, LLama index utilizes Vector Stores.
Retrieval: Leverage vectors for context retrieval.
Classes
|
Text Splitter for Chunking Documents |
- class TextSplitter(split_by: Literal['word', 'sentence', 'page', 'passage', 'token'] = 'word', chunk_size: int = 800, chunk_overlap: int = 200, batch_size: int = 1000)[source]#
Bases:
Component
Text Splitter for Chunking Documents
TextSplitter
first utilizessplit_by
to specify the text-splitting criterion and breaks the long text into smaller texts. Then we create a sliding window with length=chunk_size
. It moves at step=chunk_size
-chunk_overlap
. The texts inside each window will get merged to a smaller chunk. The generated chunks from the splitted text will be returned.Splitting Types
TextSplitter
supports 2 types of splitting.Type 1: Specify the exact text splitting point such as space<” “> and periods<”.”>. It is intuitive, for example, split_by “word”:
"Hello, world!" -> ["Hello, " ,"world!"]
Type 2: Use
tokenizer
. It works as:
"Hello, world!" -> ['Hello', ',', ' world', '!']
This aligns with how models see text in the form of tokens (Reference), Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.
Definitions
split_by specifies the split rule, i.e. the smallest unit during splitting. We support
"word"
,"sentence"
,"page"
,"passage"
, and"token"
. The splitter utilizes the corresponding separator from theSEPARATORS
dictionary.
For Type 1 splitting, we apply
Python str.split()
to break the text.SEPARATORS: Maps
split_by
criterions to their exact text separators, e.g., spaces <” “> for “word” or periods <”.”> for “sentence”.
Note
For option
token
, its separator is “” because we directly split by a tokenizer, instead of text point.chunk_size is the the maximum number of units in each chunk.
chunk_overlap is the number of units that each chunk should overlap. Including context at the borders prevents sudden meaning shift in text between sentences/context, especially in sentiment analysis.
Splitting Details
Type 1: The
TextSplitter
utilizes Python’sstr.split(separator)
method. Developers can refer to{ "page": "\f", "passage": "\n", "word": " ", "sentence": "." }
for exact points of text division.
Note
Developers need to determine how to assign text to each data chunk for the embedding and retrieval tasks.
Type 2: We implement a tokenizer using
cl100k_base
encoding that aligns with how models see text in the form of tokens. E.g. “tiktoken is great!” -> [“t”, “ik”, “token”, “ is”, “ great”, “!”] This helps developers control the token usage and budget better.Merge Details
Type 1/Type 2 create a list of split texts.
TextSplitter
then reattaches the specified separator to each piece of the split text, except for the last segment. This approach maintains the original spacing and punctuation, which is critical in contexts like natural language processing where text formatting can impact interpretations and outcomes. E.g. “hello world!” split by “word” will be kept as “hello “ and “world!”Customization
You can also customize the
SEPARATORS
. For example, by definingSEPARATORS
= {“question”: “?”} and settingsplit_by
= “question”, the document will be split at each?
, ideal for processing text structured as a series of questions. If you need to customizetokenizer
, please check Reference.Integration with Other Document Types
This functionality is ideal for segmenting texts into sentences, words, pages, or passages, which can then be processed further for NLP applications. For PDFs, developers will need to extract the text before using the splitter. Libraries like
PyPDF2
orPDFMiner
can be utilized for this purpose.LightRAG
’s future implementations will introduce splitters forJSON
,HTML
,markdown
, andcode
.Example:
from adalflow.components.data_process.text_splitter import TextSplitter from adalflow.core.types import Document # Configure the splitter settings text_splitter = TextSplitter( split_by="word", chunk_size=5, chunk_overlap=1 ) # Example document doc = Document( text="Example text. More example text. Even more text to illustrate.", id="doc1" ) # Execute the splitting splitted_docs = text_splitter.call(documents=[doc]) for doc in splitted_docs: print(doc) # Output: # Document(id=44a8aa37-0d16-40f0-9ca4-2e25ae5336c8, text='Example text. More example text. ', meta_data=None, vector=[], parent_doc_id=doc1, order=0, score=None) # Document(id=ca0af45b-4f88-49b5-97db-163da9868ea4, text='text. Even more text to ', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None) # Document(id=e7b617b2-3927-4248-afce-ec0fc247ac8b, text='to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None)
- split_text(text: str) List[str] [source]#
Splits the provided text into chunks.
Splits based on the specified split_by, chunk size, and chunk overlap settings.
- Parameters:
text (str) – The text to split.
- Returns:
A list of text chunks.
- Return type:
List[str]
- call(documents: List[Document]) List[Document] [source]#
Process the splitting task on a list of documents in batch.
Batch processes a list of documents, splitting each document’s text according to the configured split_by, chunk size, and chunk overlap.
- Parameters:
documents (List[Document]) – A list of Document objects to process.
- Returns:
A list of new Document objects, each containing a chunk of text from the original documents.
- Return type:
List[Document]
- Raises:
TypeError – If ‘documents’ is not a list or contains non-Document objects.
ValueError – If any document’s text is None.