g_eval#

Implementation of G-Eval: G-eval <https://arxiv.org/abs/2303.08774, nlpyang/geval> Instead of getting 1/5 as the score, AdalFlow will use 0.2 as the score, so that we can have a score in range [0, 1] for all metrics.

Classes

GEvalJudgeEvaluator([llm_judge])

LLM as judge for evaluating the performance of a LLM in form of GEval with 4 main metrics:

GEvalLLMJudge([model_client, model_kwargs, ...])

Demonstrate how to use an LLM/Generator to output True or False for a judgement query.

GEvalMetric(value[, names, module, ...])

NLGTask(value[, names, module, qualname, ...])

class GEvalMetric(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

RELEVANCE = 'Relevance'#
FLUENCY = 'Fluency'#
CONSISTENCY = 'Consistency'#
COHERENCE = 'Coherence'#
class NLGTask(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

SUMMARIZATION = {'criteria_coherence': 'Coherence (1-5) - the collective quality of all sentences.\n        We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized.\n        The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.', 'criteria_consistency': 'Consistency (1-5) - the factual alignment between the summary and the summarized source.\n        A factually consistent summary contains only statements that are entailed by the source document.\n        Annotators were also asked to penalize summaries that contained hallucinated facts. ', 'criteria_fluency': 'Fluency (1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\n        - 1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\n        - 2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\n        - 3: Good. The summary has few or no errors and is easy to read and follow.\n        ', 'criteria_relevance': 'Relevance (1-5) - selection of important content from the source.\n        The summary should include only important information from the source document.\n        Annotators were instructed to penalize summaries which contained redundancies and excess information.', 'steps_coherence': '1. Read the input text carefully and identify the main topic and key points.\n        2. Read the summary and assess how well it captures the main topic and key points. And if it presents them in a clear and logical order.\n        3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.', 'steps_consistency': '1. Read the summary and the source document carefully.\n        2. Identify the main facts and details it presents.\n        3. Read the summary and compare it to the source document to identify any inconsistencies or factual errors that are not supported by the source.\n        4. Assign a score for consistency based on the Evaluation Criteria.', 'steps_fluency': None, 'steps_relevance': '1. Read the summary and the source document carefully.\n        2. Compare the summary to the source document and identify the main points of the article.\n        3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\n        4. Assign a relevance score from 1 to 5.', 'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:'}#
class GEvalLLMJudge(model_client: ModelClient | None = None, model_kwargs: Dict[str, Any] | None = None, template: str | None = None, use_cache: bool = True, default_task: NLGTask | None = None)[source]#

Bases: Component

Demonstrate how to use an LLM/Generator to output True or False for a judgement query.

You can use any of your template to adapt to more tasks and sometimes you can directly ask LLM to output a score in range [0, 1] instead of only True or False.

A call on the LLM judge equalize to _compute_single_item method.

Parameters:
  • model_client (ModelClient) – The model client to use for the generator.

  • model_kwargs (Dict[str, Any], optional) – The model kwargs to pass to the model client. Defaults to {}. Please refer to ModelClient for the details on how to set the model_kwargs for your specific model if it is from our library.

  • template (str, optional) – The template to use for the LLM evaluator. Defaults to None.

  • use_cache (bool, optional) – Whether to use cache for the LLM evaluator. Defaults to True.

  • default_task (NLGTask, optional) – The default task to use for the judgement query. Defaults to None.

call(input_str: str) Dict[str, Any][source]#

Pass the input string with all information to the LLM evaluator and get the judgement.

Parameters:

input_str (str) – The input string with all information.

Returns:

The judgement result.

Return type:

Dict[str, Any]

class GEvalJudgeEvaluator(llm_judge: Component | None = None)[source]#

Bases: BaseEvaluator

LLM as judge for evaluating the performance of a LLM in form of GEval with 4 main metrics:

Relevance, Fluency, Consistency, Coherence.

Parameters:

llm_judge (Component, optional) – The LLM evaluator to use. Defaults to GEvalLLMJudge().

compute_single_item(input_str: str) Dict[str, Any][source]#

Compute the score for a single item.

Parameters:

input_str (str) – The input string with all information.

Returns:

The judgement result.

Return type:

Dict[str, Any]

compute(input_strs: List[str]) Tuple[Dict, List[Dict[str, Any]]][source]#

Get the judgement of the predicted answer for a list of questions.

Parameters:

input_strs (List[str]) – List of input strings.

Returns:

The judgement result.

Return type:

List[Dict[str, Any]]