bm25_retriever#
BM25 retriever implementation.
Functions
Classes
|
Fast Implementation of Best Matching 25 ranking function. |
- class BM25Retriever(top_k: int = 5, k1: float = 1.5, b: float = 0.75, epsilon: float = 0.25, documents: Sequence[Any] | None = None, document_map_func: Callable[[Any], str] | None = None, use_tokenizer: bool = True)[source]#
Bases:
Retriever
[str
,str
]Fast Implementation of Best Matching 25 ranking function.
It expects str as the final document type after
document_map_func
if the given document is not already in the format of List[str]. It expects Union[str, Sequence[str]] as the input inretrieve()
method.\[ \begin{align}\begin{aligned}\text{idf}(q_i) = \log\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)\\\text{score}(q, d) = \sum_{i=1}^{n} \text{idf}(q_i) \cdot \frac{f(q_i, d) \cdot (k1 + 1)}{f(q_i, d) + k1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\end{aligned}\end{align} \]- Explanation:
IDF(q_i) is the inverse document frequency of term q_i, which measures how important the term is. To avoid division by zero, 0.5 is added to the denominator, also for diminishing the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely.
f(q_i, d) is the term frequency of term q_i in document d, which measures how often the term occurs in the document. The term frequency is normalized by dividing the raw term frequency by the document length.
|d| is the length of the document d in words or tokens.
avgdl is the average document length in the corpus.
N is the total number of documents in the corpus.
n(q_i) is the number of documents containing term q_i.
References
[1] https://en.wikipedia.org/wiki/Okapi_BM25 [2] dorianbrown/rank_bm25 [3] Improvements to BM25 and Language Models Examined: https://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
- Parameters:
top_k – (int): The number of documents to return
k1 – (float, optional): Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to [1]_, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b – (float, optional): Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to [1]_, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon – (float, optional): Used to adapt the negative idf score to epilon * average_idf. Default is 0.25
documents – (List[Any], optional): The list of documents to build the index from. Default is None.
document_map_func – (Callable, optional): The function to transform the document into List[str]. You don’t need it if your documents are already in format List[str].
use_tokenizer – (bool, optional): Whether to use the default tokenizer to split the text into words. Default is True.
Examples:
from adalflow.components.retriever.bm25_retriever import BM25Retriever documents = ["hello world", "world is beautiful", "today is a good day"]
Pass the documents from the __init__ method:
retriever = BM25Retriever(top_k=1, documents=documents) output = retriever("hello") print(output) # Output: # [RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query=None, documents=None)]
Pass the documents from the
build_index_from_documents()
method:
Save the index to file and load it back:
retriever.save_to_file("bm25_index.json") retriever2 = BM25Retriever.load_from_file("bm25_index.json") output = retriever2("hello") print(output)
note: The retriever only fill in the
doc_indices
anddoc_scores
. Thedocuments
needs to be filled in by the user.- build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None, **kwargs)[source]#
Built index from the text field of each document in the list of documents
- call(input: str | Sequence[str], top_k: int | None = None, **kwargs) List[RetrieverOutput] [source]#
Retrieve the top n documents for the query and return only the indexes of the documents.
- Parameters:
input – Union[str, List[str]]: The query or list of queries
top_k – Optional[int]: The number of documents to return