Evaluation#

Overview#

eval.base

Abstract base class for evaluation metrics.

eval.answer_match_acc

This is the metric for QA generation.

eval.retriever_recall

Retriever Recall @k metric.

eval.llm_as_judge

This is the metric to use an LLM as a judge for evaluating the performance of predicted answers.

eval.g_eval

Implementation of G-Eval: G-eval <https://arxiv.org/abs/2303.08774, nlpyang/geval> Instead of getting 1/5 as the score, AdalFlow will use 0.2 as the score, so that we can have a score in range [0, 1] for all metrics.


class AnswerMatchAcc(type: Literal['exact_match', 'fuzzy_match'] = 'exact_match')[source]#

Bases: BaseEvaluator

Metric for answer matching. It compares the predicted answer with the ground truth answer.

Parameters:

type (str) – Type of matching evaluation. Can be “exact_match” or “fuzzy_match”. “exact_match” requires the predicted answer to be exactly the same as the ground truth answer. “fuzzy_match” requires the predicted answer to contain the ground truth answer.

Examples

>>> pred_answers = ["positive", "negative", "this is neutral"]
>>> gt_answers = ["positive", "negative", "neutral"]
>>> answer_match_acc = AnswerMatchAcc(type="exact_match")
>>> avg_acc, acc_list = answer_match_acc.compute(all_pred_answer, all_gt_answer)
>>> avg_acc
2 / 3
>>> acc_list
[1.0, 1.0, 0.0]
>>> answer_match_acc = AnswerMatchAcc(type="fuzzy_match")
>>> avg_acc, acc_list = answer_match_acc.compute(all_pred_answer, all_gt_answer)
>>> avg_acc
1.0
>>> acc_list
[1.0, 1.0, 1.0]
compute_single_item(y: object, y_gt: object) float[source]#

Compute the match accuracy of the predicted answer for a single query.

Allow any type of input for pred_answer and gt_answer. When evaluating, the input will be converted to string.

Parameters:
  • pred_answer (object) – Predicted answer.

  • gt_answer (object) – Ground truth answer.

Returns:

Match accuracy.

Return type:

float

compute(pred_answers: List[str], gt_answers: List[str]) EvaluationResult[source]#

Compute the match accuracy of the predicted answer for a list of queries.

Parameters:
  • pred_answers (List[str]) – List of predicted answer strings.

  • gt_answers (List[str]) – List of ground truth answer strings.

Returns:

  • float: Average match accuracy.

  • List[float]: Match accuracy values for each query.

Return type:

tuple

class RetrieverRecall[source]#

Bases: BaseEvaluator

Recall@k measures the ratio of the number of relevant context strings in the top-k retrieved context to the total number of ground truth relevant context strings.

In our implementation, we use exact string matching between each gt context and the joined retrieved context string. You can use the longest common subsequence (LCS) or other similarity metrics(or embedding based) to decide if it is a match or not.

If you do not even have the ground truth context, but only grounth truth answers, you can consider using RAGAS framework for now. It computes the recall as:

Recall = [GT statements that can be attributed to the retrieved context] / [GT statements]

Examples

>>> all_retrieved_context = [
["Apple is founded before Google.",
"Feburary has 28 days in common years. Feburary has 29 days in leap years. Feburary is the second month of the year.",
]
>>> all_gt_context = [
    [
        "Apple is founded in 1976.",
        "Google is founded in 1998.",
        "Apple is founded before Google.",
    ],
    ["Feburary has 28 days in common years", "Feburary has 29 days in leap years"],
]
>>> retriever_recall = RetrieverRecall()
>>> avg_recall, recall_list = retriever_recall.compute(all_retrieved_context, all_gt_context)
>>> avg_recall
2 / 3
>>> recall_list
[1 / 3, 1.0]

References

compute(retrieved_contexts: List[str] | List[List[str]], gt_contexts: List[List[str]]) EvaluationResult[source]#

Compute the recall of the retrieved context for a list of queries. :param retrieved_contexts: List of retrieved context strings. Using List[str] we assume you have joined all the context sentences into one string. :type retrieved_contexts: Union[List[str], List[List[str]] :param gt_contexts: List of ground truth context strings. :type gt_contexts: List[List[str]]

Returns:

  • float: Average recall value.

  • List[float]: Recall values for each query.

Return type:

tuple

class LLMasJudge(llm_judge: Component | None = None)[source]#

Bases: BaseEvaluator

LLM as judge for evaluating the performance of a LLM.

Parameters:

llm_evaluator (Component, optional) – The LLM evaluator to use. Defaults to DefaultLLMJudge.

Examples

>>> questions = [
"Is Beijing in China?",
"Is Apple founded before Google?",
"Is earth flat?",
]
>>> pred_answers = ["Yes", "Yes, Appled is founded before Google", "Yes"]
>>> gt_answers = ["Yes", "Yes", "No"]
>>> judgement_query = "For the question, does the predicted answer contain the ground truth answer?"
>>> llm_judge = LLMasJudge()
>>> avg_judgement, judgement_list = llm_judge.compute(
questions, gt_answers, pred_answers, judgement_query
)
>>> avg_judgement
2 / 3
>>> judgement_list
[True, True, False]

Customize the LLMJudge

llm_judge = Def
compute(*, pred_answers: List[str], questions: List[str] | None = None, gt_answers: List[str] | None = None) LLMJudgeEvalResult[source]#

Get the judgement of the predicted answer for a list of questions.

Parameters:
  • questions (List[str]) – List of question strings.

  • gt_answers (List[str]) – List of ground truth answer strings.

  • pred_answers (List[str]) – List of predicted answer strings.

  • judgement_query (str) – Judgement query string.

Returns:

The evaluation result.

Return type:

LLMEvalResult

class GEvalJudgeEvaluator(llm_judge: Component | None = None)[source]#

Bases: BaseEvaluator

LLM as judge for evaluating the performance of a LLM in form of GEval with 4 main metrics:

Relevance, Fluency, Consistency, Coherence.

Parameters:

llm_judge (Component, optional) – The LLM evaluator to use. Defaults to GEvalLLMJudge().

compute_single_item(input_str: str) Dict[str, Any][source]#

Compute the score for a single item.

Parameters:

input_str (str) – The input string with all information.

Returns:

The judgement result.

Return type:

Dict[str, Any]

compute(input_strs: List[str]) Tuple[Dict, List[Dict[str, Any]]][source]#

Get the judgement of the predicted answer for a list of questions.

Parameters:

input_strs (List[str]) – List of input strings.

Returns:

The judgement result.

Return type:

List[Dict[str, Any]]

class GEvalLLMJudge(model_client: ModelClient | None = None, model_kwargs: Dict[str, Any] | None = None, template: str | None = None, use_cache: bool = True, default_task: NLGTask | None = None)[source]#

Bases: Component

Demonstrate how to use an LLM/Generator to output True or False for a judgement query.

You can use any of your template to adapt to more tasks and sometimes you can directly ask LLM to output a score in range [0, 1] instead of only True or False.

A call on the LLM judge equalize to _compute_single_item method.

Parameters:
  • model_client (ModelClient) – The model client to use for the generator.

  • model_kwargs (Dict[str, Any], optional) – The model kwargs to pass to the model client. Defaults to {}. Please refer to ModelClient for the details on how to set the model_kwargs for your specific model if it is from our library.

  • template (str, optional) – The template to use for the LLM evaluator. Defaults to None.

  • use_cache (bool, optional) – Whether to use cache for the LLM evaluator. Defaults to True.

  • default_task (NLGTask, optional) – The default task to use for the judgement query. Defaults to None.

call(input_str: str) Dict[str, Any][source]#

Pass the input string with all information to the LLM evaluator and get the judgement.

Parameters:

input_str (str) – The input string with all information.

Returns:

The judgement result.

Return type:

Dict[str, Any]

class GEvalMetric(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

RELEVANCE = 'Relevance'#
FLUENCY = 'Fluency'#
CONSISTENCY = 'Consistency'#
COHERENCE = 'Coherence'#