retriever¶

Submodules¶

class BM25Retriever(top_k: int = 5, k1: float = 1.5, b: float = 0.75, epsilon: float = 0.25, documents: Sequence[Any] | None = None, document_map_func: Callable[[Any], str] | None = None, use_tokenizer: bool = True)[source]¶

Bases: Retriever[str, str]

Fast Implementation of Best Matching 25 ranking function.

It expects str as the final document type after document_map_func if the given document is not already in the format of List[str]. It expects Union[str, Sequence[str]] as the input in retrieve() method.

\[ \begin{align}\begin{aligned}\text{idf}(q_i) = \log\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)\\\text{score}(q, d) = \sum_{i=1}^{n} \text{idf}(q_i) \cdot \frac{f(q_i, d) \cdot (k1 + 1)}{f(q_i, d) + k1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\end{aligned}\end{align} \]

Explanation:

IDF(q_i) is the inverse document frequency of term q_i, which measures how important the term is. To avoid division by zero, 0.5 is added to the denominator, also for diminishing the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely.
f(q_i, d) is the term frequency of term q_i in document d, which measures how often the term occurs in the document. The term frequency is normalized by dividing the raw term frequency by the document length.
|d| is the length of the document d in words or tokens.
avgdl is the average document length in the corpus.
N is the total number of documents in the corpus.
n(q_i) is the number of documents containing term q_i.

References

[1] https://en.wikipedia.org/wiki/Okapi_BM25 [2] https://github.com/dorianbrown/rank_bm25 [3] Improvements to BM25 and Language Models Examined: https://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf

Parameters:

top_k – (int): The number of documents to return
k1 – (float, optional): Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to [1]_, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b – (float, optional): Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to [1]_, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon – (float, optional): Used to adapt the negative idf score to epilon * average_idf. Default is 0.25
documents – (List[Any], optional): The list of documents to build the index from. Default is None.
document_map_func – (Callable, optional): The function to transform the document into List[str]. You don’t need it if your documents are already in format List[str].
use_tokenizer – (bool, optional): Whether to use the default tokenizer to split the text into words. Default is True.

Examples:

from adalflow.components.retriever.bm25_retriever import BM25Retriever

documents = ["hello world", "world is beautiful", "today is a good day"]

Pass the documents from the __init__ method:

retriever = BM25Retriever(top_k=1, documents=documents)
output = retriever("hello")
print(output)
# Output:
# [RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query=None, documents=None)]

Pass the documents from the build_index_from_documents() method:

Save the index to file and load it back:

retriever.save_to_file("bm25_index.json")
retriever2 = BM25Retriever.load_from_file("bm25_index.json")
output = retriever2("hello")
print(output)

note: The retriever only fill in the doc_indices and doc_scores. The documents needs to be filled in by the user.

reset_index()[source]¶: Used for both initializing and resetting the index.

build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None, **kwargs)[source]¶: Built index from the text field of each document in the list of documents

call(input: str | Sequence[str], top_k: int | None = None, **kwargs) → List[RetrieverOutput] | RetrieverOutput[source]¶

Retrieve the top n documents for the query and return only the indexes of the documents.

Parameters:

input – Union[str, List[str]]: The query or list of queries
top_k – Optional[int]: The number of documents to return

save_to_file(path: str)[source]¶

Save the state, including the index to a file.

Optional for subclass to implement a default persistence method. Subclass can leverge component’s to_dict method to get the states and choose to save them in any file format.

classmethod load_from_file(path: str)[source]¶

Load the state, including index from a file to restore the retriever.

Subclass can leverge component’s from_dict method to restore the states from the file.

class LLMRetriever(*, top_k: int | None = 1, model_client: ModelClient, model_kwargs: Dict[str, Any] = {}, documents: Sequence[RetrieverDocumentType] | None = None, document_map_func: Callable[[Any], str] | None = None)[source]¶

Bases: Retriever[str, str]

Use LLM to access the query and the documents to retrieve the top k relevant indices of the documents.

Users can follow this example and to customize the prompt or additionally ask it to output score along with the indices.

Parameters:

top_k (Optional[int], optional) – top k documents to fetch. Defaults to 1.
model_client (ModelClient) – the model client to use.
model_kwargs (Dict[str, Any], optional) – the model kwargs. Defaults to {}.

Note

There is chance some queries might fail, which will lead to empty response None for that query in the List of RetrieverOutput. Users should handle this case.

reset_index()[source]¶: Initialize/reset any attributes/states for the index.

build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None)[source]¶: prepare the user query input for the retriever

call(input: str | Sequence[str], top_k: int | None = None, model_kwargs: Dict[str, Any] = {}) → List[RetrieverOutput] | RetrieverOutput[source]¶

Retrieve the k relevant documents.

Parameters:

query_or_queries (RetrieverStrQueriesType) – a string or a list of strings.
top_k (Optional[int], optional) – top k documents to fetch. Defaults to None.
model_kwargs (Dict[str, Any], optional) – the model kwargs. You can switch to another model provided by the same model client without reinitializing the retriever. Defaults to {}.

Returns:

the developers should be aware that the returned LLMRetrieverOutputType is actually a list of GeneratorOutput(GeneratorOutput), post processing is required depends on how you instruct the model to output in the prompt and what output_processors you set up. E.g. If the prompt is to output a list of indices and the output_processors is ListParser(), then it return: GeneratorOutput(data=[indices], error=None, raw_response=’[indices]’)

Return type:

RetrieverOutputType

class RerankerRetriever(model_client: ModelClient, model_kwargs: Dict = {}, top_k: int = 5, documents: Sequence[RetrieverDocumentType] | None = None, document_map_func: Callable[[Any], str] | None = None)[source]¶

Bases: Retriever[str, str | Sequence[str]]

A retriever that uses a reranker model to rank the documents and retrieve the top-k documents.

Parameters:

top_k (int, optional) – The number of top documents to retrieve. Defaults to 5.
model_client (ModelClient) – The model client that has a reranker model, such as CohereAPIClient or TransformersClient.
model_kwargs (Dict) – The model kwargs to pass to the model client.
documents (Optional[RetrieverDocumentsType], optional) – The documents to build the index from. Defaults to None.
document_map_func (Optional[Callable[[Any], str]], optional) – The function to map the document of Any type to the specific type RetrieverDocumentType that the retriever expects. Defaults to None.

Examples:

reset_index()[source]¶: Initialize/reset any attributes/states for the index.

build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None)[source]¶: Built index from the [document_map_func(doc) for doc in documents].

call(input: str | Sequence[str], top_k: int | None = None) → List[RetrieverOutput] | RetrieverOutput[source]¶: User must override this for the inference scenario if bicall is not defined.

split_text_by_word_fn(x: str) → List[str][source]¶

split_text_by_word_fn_then_lower_tokenized(x: str) → List[str][source]¶