text_splitter¶

Splitting texts is commonly used as a preprocessing step before embedding and retrieving texts.

We encourage you to process your data here and define your own embedding and retrieval methods. These methods can highly depend on the product environment and may extend beyond the scope of this library.

However, the following approaches are commonly shared:

Document Storage: Define how to store the documents, both raw and chunked. For example, LlamaIndex uses Document Stores to manage ingested document chunks.
Document Chunking: Segment documents into manageable chunks suitable for further processing.
Vectorization: Embed each chunk and store the resulting vectors in Vector Stores. For example, LLama index utilizes Vector Stores.
Retrieval: Leverage vectors for context retrieval.

Classes

TextSplitter([split_by, chunk_size, ...])

Text Splitter for Chunking Documents

class TextSplitter(split_by: Literal['word', 'sentence', 'page', 'passage', 'token'] = 'word', chunk_size: int = 800, chunk_overlap: int = 200, batch_size: int = 1000, separators: dict | None = {'page': '\x0c', 'passage': '\n\n', 'sentence': '.', 'token': '', 'word': ' '})[source]¶

Bases: DataComponent

Text Splitter for Chunking Documents

TextSplitter first utilizes split_by to specify the text-splitting criterion and breaks the long text into smaller texts. Then we create a sliding window with length= chunk_size. It moves at step= chunk_size - chunk_overlap. The texts inside each window will get merged to a smaller chunk. The generated chunks from the splitted text will be returned.

Splitting Types

TextSplitter supports 2 types of splitting.

Type 1: Specify the exact text splitting point such as space<” “> and periods<”.”>. It is intuitive, for example, split_by “word”:

"Hello, world!" -> ["Hello, " ,"world!"]

Type 2: Use tokenizer. It works as:

"Hello, world!" -> ['Hello', ',', ' world', '!']

This aligns with how models see text in the form of tokens (Reference), Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.

Definitions

split_by specifies the split rule, i.e. the smallest unit during splitting. We support "word", "sentence", "page", "passage", and "token". The splitter utilizes the corresponding separator from the SEPARATORS dictionary.

For Type 1 splitting, we apply Python str.split() to break the text.

SEPARATORS: Maps split_by criterions to their exact text separators, e.g., spaces <” “> for “word” or periods <”.”> for “sentence”.

Note

For option token, its separator is “” because we directly split by a tokenizer, instead of text point.

chunk_size is the the maximum number of units in each chunk.
chunk_overlap is the number of units that each chunk should overlap. Including context at the borders prevents sudden meaning shift in text between sentences/context, especially in sentiment analysis.
Splitting Details

Type 1: The TextSplitter utilizes Python’s str.split(separator) method. Developers can refer to

{
    "page": "\f",
    "passage": "\n",
    "word": " ",
    "sentence": "."
}

for exact points of text division.

Note

Developers need to determine how to assign text to each data chunk for the embedding and retrieval tasks.

Type 2: We implement a tokenizer using cl100k_base encoding that aligns with how models see text in the form of tokens. E.g. “tiktoken is great!” -> [“t”, “ik”, “token”, “ is”, “ great”, “!”] This helps developers control the token usage and budget better.

Merge Details

Type 1/Type 2 create a list of split texts. TextSplitter then reattaches the specified separator to each piece of the split text, except for the last segment. This approach maintains the original spacing and punctuation, which is critical in contexts like natural language processing where text formatting can impact interpretations and outcomes. E.g. “hello world!” split by “word” will be kept as “hello “ and “world!”

Customization

You can also customize the SEPARATORS. For example, by defining SEPARATORS = {“question”: “?”} and setting split_by = “question”, the document will be split at each ?, ideal for processing text structured as a series of questions. If you need to customize tokenizer, please check Reference.

Integration with Other Document Types

This functionality is ideal for segmenting texts into sentences, words, pages, or passages, which can then be processed further for NLP applications. For PDFs, developers will need to extract the text before using the splitter. Libraries like PyPDF2 or PDFMiner can be utilized for this purpose. AdalFlow’s future implementations will introduce splitters for JSON, HTML, markdown, and code.

Example:

from adalflow.components.data_process.text_splitter import TextSplitter
from adalflow.core.types import Document

# Configure the splitter settings
text_splitter = TextSplitter(
    split_by="word",
    chunk_size=5,
    chunk_overlap=1
)

# Example document
doc = Document(
    text="Example text. More example text. Even more text to illustrate.",
    id="doc1"
)

# Execute the splitting
splitted_docs = text_splitter.call(documents=[doc])

for doc in splitted_docs:
    print(doc)

# Output:
# Document(id=44a8aa37-0d16-40f0-9ca4-2e25ae5336c8, text='Example text. More example text. ', meta_data=None, vector=[], parent_doc_id=doc1, order=0, score=None)
# Document(id=ca0af45b-4f88-49b5-97db-163da9868ea4, text='text. Even more text to ', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None)
# Document(id=e7b617b2-3927-4248-afce-ec0fc247ac8b, text='to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None)

tokenizer = <adalflow.core.tokenizer.Tokenizer object>¶

split_text(text: str) → List[str][source]¶

Splits the provided text into chunks.

Splits based on the specified split_by, chunk size, and chunk overlap settings.

Parameters:: text (str) – The text to split.
Returns:: A list of text chunks.
Return type:: List[str]

call(documents: List[Document]) → List[Document][source]¶

Process the splitting task on a list of documents in batch.

Batch processes a list of documents, splitting each document’s text according to the configured split_by, chunk size, and chunk overlap.

Parameters:

documents (List[Document]) – A list of Document objects to process.

Returns:

A list of new Document objects, each containing a chunk of text from the original documents.

Return type:

List[Document]

Raises:

TypeError – If ‘documents’ is not a list or contains non-Document objects.
ValueError – If any document’s text is None.

training: bool¶

teacher_mode: bool¶

tracing: bool¶