llm_as_judge#

This is the metric to use an LLM as a judge for evaluating the performance of predicted answers.

Classes

DefaultLLMJudge([model_client, ...])

Demonstrate how to use an LLM/Generator to output True or False for a judgement query.

LLMJudgeEvalResult(avg_score, ...)

LLMasJudge([llm_judge])

LLM as judge for evaluating the performance of a LLM.

class DefaultLLMJudge(model_client: ModelClient | None = None, model_kwargs: Dict[str, Any] | None = None, template: str | None = None, jugement_query: str | None = None, example_str: str | None = None, output_type: Literal['bool', 'float'] = 'bool', use_cache: bool = True)[source]#

Bases: Component

Demonstrate how to use an LLM/Generator to output True or False for a judgement query.

You can use any of your template to adapt to more tasks and sometimes you can directly ask LLM to output a score in range [0, 1] instead of only True or False.

A call on the LLM judge equalize to _compute_single_item method.

Parameters:
  • model_client (ModelClient) – The model client to use for the generator.

  • model_kwargs (Dict[str, Any], optional) – The model kwargs to pass to the model client. Defaults to {}. Please refer to ModelClient for the details on how to set the model_kwargs for your specific model if it is from our library.

  • template (str, optional) – The template to use for the LLM evaluator. Defaults to None.

  • jugement_query (str, optional) – The judgement query string. Defaults to DEFAULT_JUDGEMENT_QUERY.

  • output_type (Literal["bool", "float"], optional) – The output type of the judgement. Defaults to “bool”.

  • use_cache (bool, optional) – Whether to use cache for the LLM evaluator. Defaults to True.

Note

Must use True/False instead of Yes/No in the judgement_query for response.

call(question: str, gt_answer: str, pred_answer: str) bool | float[source]#

Get the judgement of the predicted answer for a single question.

Parameters:
  • question (str) – Question string.

  • gt_answer (str) – Ground truth answer string.

  • pred_answer (str) – Predicted answer string.

  • judgement_query (str) – Judgement query string.

Returns:

Judgement result.

Return type:

bool

class LLMasJudge(llm_judge: Component | None = None)[source]#

Bases: BaseEvaluator

LLM as judge for evaluating the performance of a LLM.

Parameters:

llm_evaluator (Component, optional) – The LLM evaluator to use. Defaults to DefaultLLMJudge.

Examples

>>> questions = [
"Is Beijing in China?",
"Is Apple founded before Google?",
"Is earth flat?",
]
>>> pred_answers = ["Yes", "Yes, Appled is founded before Google", "Yes"]
>>> gt_answers = ["Yes", "Yes", "No"]
>>> judgement_query = "For the question, does the predicted answer contain the ground truth answer?"
>>> llm_judge = LLMasJudge()
>>> avg_judgement, judgement_list = llm_judge.compute(
questions, gt_answers, pred_answers, judgement_query
)
>>> avg_judgement
2 / 3
>>> judgement_list
[True, True, False]

Customize the LLMJudge

llm_judge = Def
compute(*, pred_answers: List[str], questions: List[str] | None = None, gt_answers: List[str] | None = None) LLMJudgeEvalResult[source]#

Get the judgement of the predicted answer for a list of questions.

Parameters:
  • questions (List[str]) – List of question strings.

  • gt_answers (List[str]) – List of ground truth answer strings.

  • pred_answers (List[str]) – List of predicted answer strings.

  • judgement_query (str) – Judgement query string.

Returns:

The evaluation result.

Return type:

LLMEvalResult

class LLMJudgeEvalResult(avg_score: float, judgement_score_list: List[bool], confidence_interval: Tuple[float, float])[source]#

Bases: object

avg_score: float#
judgement_score_list: List[bool]#
confidence_interval: Tuple[float, float]#