Evaluation#
Overview#
Abstract base class for evaluation metrics. |
|
This is the metric for QA generation. |
|
Retriever Recall @k metric. |
|
This is the metric to use an LLM as a judge for evaluating the performance of predicted answers. |
|
Implementation of G-Eval: G-eval <https://arxiv.org/abs/2303.08774, nlpyang/geval> Instead of getting 1/5 as the score, AdalFlow will use 0.2 as the score, so that we can have a score in range [0, 1] for all metrics. |
- class AnswerMatchAcc(type: Literal['exact_match', 'fuzzy_match'] = 'exact_match')[source]#
Bases:
BaseEvaluator
Metric for answer matching. It compares the predicted answer with the ground truth answer.
- Parameters:
type (str) – Type of matching evaluation. Can be “exact_match” or “fuzzy_match”. “exact_match” requires the predicted answer to be exactly the same as the ground truth answer. “fuzzy_match” requires the predicted answer to contain the ground truth answer.
Examples
>>> pred_answers = ["positive", "negative", "this is neutral"] >>> gt_answers = ["positive", "negative", "neutral"] >>> answer_match_acc = AnswerMatchAcc(type="exact_match") >>> avg_acc, acc_list = answer_match_acc.compute(all_pred_answer, all_gt_answer) >>> avg_acc 2 / 3 >>> acc_list [1.0, 1.0, 0.0] >>> answer_match_acc = AnswerMatchAcc(type="fuzzy_match") >>> avg_acc, acc_list = answer_match_acc.compute(all_pred_answer, all_gt_answer) >>> avg_acc 1.0 >>> acc_list [1.0, 1.0, 1.0]
- compute_single_item(y: object, y_gt: object) float [source]#
Compute the match accuracy of the predicted answer for a single query.
Allow any type of input for pred_answer and gt_answer. When evaluating, the input will be converted to string.
- Parameters:
pred_answer (object) – Predicted answer.
gt_answer (object) – Ground truth answer.
- Returns:
Match accuracy.
- Return type:
float
- compute(pred_answers: List[str], gt_answers: List[str]) EvaluationResult [source]#
Compute the match accuracy of the predicted answer for a list of queries.
- Parameters:
pred_answers (List[str]) – List of predicted answer strings.
gt_answers (List[str]) – List of ground truth answer strings.
- Returns:
float: Average match accuracy.
List[float]: Match accuracy values for each query.
- Return type:
tuple
- class RetrieverRecall[source]#
Bases:
BaseEvaluator
Recall@k measures the ratio of the number of relevant context strings in the top-k retrieved context to the total number of ground truth relevant context strings.
In our implementation, we use exact string matching between each gt context and the joined retrieved context string. You can use the longest common subsequence (LCS) or other similarity metrics(or embedding based) to decide if it is a match or not.
If you do not even have the ground truth context, but only grounth truth answers, you can consider using RAGAS framework for now. It computes the recall as:
Recall = [GT statements that can be attributed to the retrieved context] / [GT statements]
Examples
>>> all_retrieved_context = [ ["Apple is founded before Google.", "Feburary has 28 days in common years. Feburary has 29 days in leap years. Feburary is the second month of the year.", ] >>> all_gt_context = [ [ "Apple is founded in 1976.", "Google is founded in 1998.", "Apple is founded before Google.", ], ["Feburary has 28 days in common years", "Feburary has 29 days in leap years"], ] >>> retriever_recall = RetrieverRecall() >>> avg_recall, recall_list = retriever_recall.compute(all_retrieved_context, all_gt_context) >>> avg_recall 2 / 3 >>> recall_list [1 / 3, 1.0]
References
- compute(retrieved_contexts: List[str] | List[List[str]], gt_contexts: List[List[str]]) EvaluationResult [source]#
Compute the recall of the retrieved context for a list of queries. :param retrieved_contexts: List of retrieved context strings. Using List[str] we assume you have joined all the context sentences into one string. :type retrieved_contexts: Union[List[str], List[List[str]] :param gt_contexts: List of ground truth context strings. :type gt_contexts: List[List[str]]
- Returns:
float: Average recall value.
List[float]: Recall values for each query.
- Return type:
tuple
- class LLMasJudge(llm_judge: Component | None = None)[source]#
Bases:
BaseEvaluator
LLM as judge for evaluating the performance of a LLM.
- Parameters:
llm_evaluator (Component, optional) – The LLM evaluator to use. Defaults to DefaultLLMJudge.
Examples
>>> questions = [ "Is Beijing in China?", "Is Apple founded before Google?", "Is earth flat?", ] >>> pred_answers = ["Yes", "Yes, Appled is founded before Google", "Yes"] >>> gt_answers = ["Yes", "Yes", "No"] >>> judgement_query = "For the question, does the predicted answer contain the ground truth answer?" >>> llm_judge = LLMasJudge() >>> avg_judgement, judgement_list = llm_judge.compute( questions, gt_answers, pred_answers, judgement_query ) >>> avg_judgement 2 / 3 >>> judgement_list [True, True, False]
Customize the LLMJudge
llm_judge = Def
- compute(*, pred_answers: List[str], questions: List[str] | None = None, gt_answers: List[str] | None = None) LLMJudgeEvalResult [source]#
Get the judgement of the predicted answer for a list of questions.
- Parameters:
questions (List[str]) – List of question strings.
gt_answers (List[str]) – List of ground truth answer strings.
pred_answers (List[str]) – List of predicted answer strings.
judgement_query (str) – Judgement query string.
- Returns:
The evaluation result.
- Return type:
LLMEvalResult
- class GEvalJudgeEvaluator(llm_judge: Component | None = None)[source]#
Bases:
BaseEvaluator
LLM as judge for evaluating the performance of a LLM in form of GEval with 4 main metrics:
Relevance, Fluency, Consistency, Coherence.
- Parameters:
llm_judge (Component, optional) – The LLM evaluator to use. Defaults to GEvalLLMJudge().
- class GEvalLLMJudge(model_client: ModelClient | None = None, model_kwargs: Dict[str, Any] | None = None, template: str | None = None, use_cache: bool = True, default_task: NLGTask | None = None)[source]#
Bases:
Component
Demonstrate how to use an LLM/Generator to output True or False for a judgement query.
You can use any of your template to adapt to more tasks and sometimes you can directly ask LLM to output a score in range [0, 1] instead of only True or False.
A call on the LLM judge equalize to _compute_single_item method.
- Parameters:
model_client (ModelClient) – The model client to use for the generator.
model_kwargs (Dict[str, Any], optional) – The model kwargs to pass to the model client. Defaults to {}. Please refer to ModelClient for the details on how to set the model_kwargs for your specific model if it is from our library.
template (str, optional) – The template to use for the LLM evaluator. Defaults to None.
use_cache (bool, optional) – Whether to use cache for the LLM evaluator. Defaults to True.
default_task (NLGTask, optional) – The default task to use for the judgement query. Defaults to None.