retriever#
Submodules#
- class BM25Retriever(top_k: int = 5, k1: float = 1.5, b: float = 0.75, epsilon: float = 0.25, documents: Sequence[Any] | None = None, document_map_func: Callable[[Any], str] | None = None, use_tokenizer: bool = True)[source]#
Bases:
Retriever
[str
,str
]Fast Implementation of Best Matching 25 ranking function.
It expects str as the final document type after
document_map_func
if the given document is not already in the format of List[str]. It expects Union[str, Sequence[str]] as the input inretrieve()
method.\[ \begin{align}\begin{aligned}\text{idf}(q_i) = \log\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)\\\text{score}(q, d) = \sum_{i=1}^{n} \text{idf}(q_i) \cdot \frac{f(q_i, d) \cdot (k1 + 1)}{f(q_i, d) + k1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\end{aligned}\end{align} \]- Explanation:
IDF(q_i) is the inverse document frequency of term q_i, which measures how important the term is. To avoid division by zero, 0.5 is added to the denominator, also for diminishing the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely.
f(q_i, d) is the term frequency of term q_i in document d, which measures how often the term occurs in the document. The term frequency is normalized by dividing the raw term frequency by the document length.
|d| is the length of the document d in words or tokens.
avgdl is the average document length in the corpus.
N is the total number of documents in the corpus.
n(q_i) is the number of documents containing term q_i.
References
[1] https://en.wikipedia.org/wiki/Okapi_BM25 [2] dorianbrown/rank_bm25 [3] Improvements to BM25 and Language Models Examined: https://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
- Parameters:
top_k – (int): The number of documents to return
k1 – (float, optional): Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to [1]_, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b – (float, optional): Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to [1]_, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon – (float, optional): Used to adapt the negative idf score to epilon * average_idf. Default is 0.25
documents – (List[Any], optional): The list of documents to build the index from. Default is None.
document_map_func – (Callable, optional): The function to transform the document into List[str]. You don’t need it if your documents are already in format List[str].
use_tokenizer – (bool, optional): Whether to use the default tokenizer to split the text into words. Default is True.
Examples:
from adalflow.components.retriever.bm25_retriever import BM25Retriever documents = ["hello world", "world is beautiful", "today is a good day"]
Pass the documents from the __init__ method:
retriever = BM25Retriever(top_k=1, documents=documents) output = retriever("hello") print(output) # Output: # [RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query=None, documents=None)]
Pass the documents from the
build_index_from_documents()
method:
Save the index to file and load it back:
retriever.save_to_file("bm25_index.json") retriever2 = BM25Retriever.load_from_file("bm25_index.json") output = retriever2("hello") print(output)
note: The retriever only fill in the
doc_indices
anddoc_scores
. Thedocuments
needs to be filled in by the user.- build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None, **kwargs)[source]#
Built index from the text field of each document in the list of documents
- call(input: str | Sequence[str], top_k: int | None = None, **kwargs) List[RetrieverOutput] [source]#
Retrieve the top n documents for the query and return only the indexes of the documents.
- Parameters:
input – Union[str, List[str]]: The query or list of queries
top_k – Optional[int]: The number of documents to return
- class LLMRetriever(*, top_k: int | None = 1, model_client: ModelClient, model_kwargs: Dict[str, Any] = {}, documents: Sequence[RetrieverDocumentType] | None = None, document_map_func: Callable[[Any], str] | None = None)[source]#
Bases:
Retriever
[str
,str
]Use LLM to access the query and the documents to retrieve the top k relevant indices of the documents.
Users can follow this example and to customize the prompt or additionally ask it to output score along with the indices.
- Parameters:
top_k (Optional[int], optional) – top k documents to fetch. Defaults to 1.
model_client (ModelClient) – the model client to use.
model_kwargs (Dict[str, Any], optional) – the model kwargs. Defaults to {}.
Note
There is chance some queries might fail, which will lead to empty response None for that query in the List of RetrieverOutput. Users should handle this case.
- build_index_from_documents(documents: Sequence[RetrieverDocumentType], document_map_func: Callable[[Any], str] | None = None)[source]#
prepare the user query input for the retriever
- call(input: str | Sequence[str], top_k: int | None = None, model_kwargs: Dict[str, Any] = {}) List[RetrieverOutput] [source]#
Retrieve the k relevant documents.
- Parameters:
query_or_queries (RetrieverStrQueriesType) – a string or a list of strings.
top_k (Optional[int], optional) – top k documents to fetch. Defaults to None.
model_kwargs (Dict[str, Any], optional) – the model kwargs. You can switch to another model provided by the same model client without reinitializing the retriever. Defaults to {}.
- Returns:
the developers should be aware that the returned
LLMRetrieverOutputType
is actually a list of GeneratorOutput(GeneratorOutput
), post processing is required depends on how you instruct the model to output in the prompt and whatoutput_processors
you set up. E.g. If the prompt is to output a list of indices and theoutput_processors
isListParser()
, then it return: GeneratorOutput(data=[indices], error=None, raw_response=’[indices]’)- Return type:
RetrieverOutputType
- class RerankerRetriever(model_client: ModelClient, model_kwargs: Dict = {}, top_k: int = 5, documents: Sequence[RetrieverDocumentType] | None = None, document_map_func: Callable[[Any], str] | None = None)[source]#
Bases:
Retriever
[str
,str
|Sequence
[str
]]A retriever that uses a reranker model to rank the documents and retrieve the top-k documents.
- Parameters:
top_k (int, optional) – The number of top documents to retrieve. Defaults to 5.
model_client (ModelClient) – The model client that has a reranker model, such as
CohereAPIClient
orTransformersClient
.model_kwargs (Dict) – The model kwargs to pass to the model client.
documents (Optional[RetrieverDocumentsType], optional) – The documents to build the index from. Defaults to None.
document_map_func (Optional[Callable[[Any], str]], optional) – The function to map the document of Any type to the specific type
RetrieverDocumentType
that the retriever expects. Defaults to None.
Examples: