bm25_retriever¶

BM25 retriever implementation.

Functions

`split_text_by_word_fn`(x)
`split_text_by_word_fn_then_lower_tokenized`(x)
`split_text_tokenized`(x)

Classes

BM25Retriever([top_k, k1, b, epsilon, ...])

Fast Implementation of Best Matching 25 ranking function.

split_text_by_word_fn(x: str) → List[str][source]¶

split_text_by_word_fn_then_lower_tokenized(x: str) → List[str][source]¶

split_text_tokenized(x: str) → List[str][source]¶

class BM25Retriever(top_k: int = 5, k1: float = 1.5, b: float = 0.75, epsilon: float = 0.25, documents: Sequence[Any] | None = None, document_map_func: Callable[[Any], str] | None = None, use_tokenizer: bool = True)[source]¶

Bases: Retriever[str, str]

Fast Implementation of Best Matching 25 ranking function.

It expects str as the final document type after document_map_func if the given document is not already in the format of List[str]. It expects Union[str, Sequence[str]] as the input in retrieve() method.

\[ \begin{align}\begin{aligned}\text{idf}(q_i) = \log\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)\\\text{score}(q, d) = \sum_{i=1}^{n} \text{idf}(q_i) \cdot \frac{f(q_i, d) \cdot (k1 + 1)}{f(q_i, d) + k1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\end{aligned}\end{align} \]

Explanation:

IDF(q_i) is the inverse document frequency of term q_i, which measures how important the term is. To avoid division by zero, 0.5 is added to the denominator, also for diminishing the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely.
f(q_i, d) is the term frequency of term q_i in document d, which measures how often the term occurs in the document. The term frequency is normalized by dividing the raw term frequency by the document length.
|d| is the length of the document d in words or tokens.
avgdl is the average document length in the corpus.
N is the total number of documents in the corpus.
n(q_i) is the number of documents containing term q_i.

References

[1] https://en.wikipedia.org/wiki/Okapi_BM25 [2] https://github.com/dorianbrown/rank_bm25 [3] Improvements to BM25 and Language Models Examined: https://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf

Parameters:

top_k – (int): The number of documents to return
k1 – (float, optional): Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to [1]_, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b – (float, optional): Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to [1]_, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon – (float, optional): Used to adapt the negative idf score to epilon * average_idf. Default is 0.25
documents – (List[Any], optional): The list of documents to build the index from. Default is None.
document_map_func – (Callable, optional): The function to transform the document into List[str]. You don’t need it if your documents are already in format List[str].
use_tokenizer – (bool, optional): Whether to use the default tokenizer to split the text into words. Default is True.

Examples:

from adalflow.components.retriever.bm25_retriever import BM25Retriever

documents = ["hello world", "world is beautiful", "today is a good day"]

Pass the documents from the __init__ method:

retriever = BM25Retriever(top_k=1, documents=documents)
output = retriever("hello")
print(output)
# Output:
# [RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query=None, documents=None)]

Pass the documents from the build_index_from_documents() method:

Save the index to file and load it back:

retriever.save_to_file("bm25_index.json")
retriever2 = BM25Retriever.load_from_file("bm25_index.json")
output = retriever2("hello")
print(output)

note: The retriever only fill in the doc_indices and doc_scores. The documents needs to be filled in by the user.

reset_index()[source]¶: Used for both initializing and resetting the index.

build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None, **kwargs)[source]¶: Built index from the text field of each document in the list of documents

call(input: str | Sequence[str], top_k: int | None = None, **kwargs) → List[RetrieverOutput] | RetrieverOutput[source]¶

Retrieve the top n documents for the query and return only the indexes of the documents.

Parameters:

input – Union[str, List[str]]: The query or list of queries
top_k – Optional[int]: The number of documents to return

save_to_file(path: str)[source]¶

Save the state, including the index to a file.

Optional for subclass to implement a default persistence method. Subclass can leverge component’s to_dict method to get the states and choose to save them in any file format.

classmethod load_from_file(path: str)[source]¶

Load the state, including index from a file to restore the retriever.

Subclass can leverge component’s from_dict method to restore the states from the file.