llm_as_judge#
This is the metric to use an LLM as a judge for evaluating the performance of predicted answers.
Classes
|
Demonstrate how to use an LLM/Generator to output True or False for a judgement query. |
|
|
|
LLM as judge for evaluating the performance of a LLM. |
- class DefaultLLMJudge(model_client: ModelClient | None = None, model_kwargs: Dict[str, Any] | None = None, template: str | None = None, jugement_query: str | None = None, example_str: str | None = None, output_type: Literal['bool', 'float'] = 'bool', use_cache: bool = True)[source]#
Bases:
Component
Demonstrate how to use an LLM/Generator to output True or False for a judgement query.
You can use any of your template to adapt to more tasks and sometimes you can directly ask LLM to output a score in range [0, 1] instead of only True or False.
A call on the LLM judge equalize to _compute_single_item method.
- Parameters:
model_client (ModelClient) – The model client to use for the generator.
model_kwargs (Dict[str, Any], optional) – The model kwargs to pass to the model client. Defaults to {}. Please refer to ModelClient for the details on how to set the model_kwargs for your specific model if it is from our library.
template (str, optional) – The template to use for the LLM evaluator. Defaults to None.
jugement_query (str, optional) – The judgement query string. Defaults to DEFAULT_JUDGEMENT_QUERY.
output_type (Literal["bool", "float"], optional) – The output type of the judgement. Defaults to “bool”.
use_cache (bool, optional) – Whether to use cache for the LLM evaluator. Defaults to True.
Note
Must use True/False instead of Yes/No in the judgement_query for response.
- call(question: str, gt_answer: str, pred_answer: str) bool | float [source]#
Get the judgement of the predicted answer for a single question.
- Parameters:
question (str) – Question string.
gt_answer (str) – Ground truth answer string.
pred_answer (str) – Predicted answer string.
judgement_query (str) – Judgement query string.
- Returns:
Judgement result.
- Return type:
bool
- class LLMasJudge(llm_judge: Component | None = None)[source]#
Bases:
BaseEvaluator
LLM as judge for evaluating the performance of a LLM.
- Parameters:
llm_evaluator (Component, optional) – The LLM evaluator to use. Defaults to DefaultLLMJudge.
Examples
>>> questions = [ "Is Beijing in China?", "Is Apple founded before Google?", "Is earth flat?", ] >>> pred_answers = ["Yes", "Yes, Appled is founded before Google", "Yes"] >>> gt_answers = ["Yes", "Yes", "No"] >>> judgement_query = "For the question, does the predicted answer contain the ground truth answer?" >>> llm_judge = LLMasJudge() >>> avg_judgement, judgement_list = llm_judge.compute( questions, gt_answers, pred_answers, judgement_query ) >>> avg_judgement 2 / 3 >>> judgement_list [True, True, False]
Customize the LLMJudge
llm_judge = Def
- compute(*, pred_answers: List[str], questions: List[str] | None = None, gt_answers: List[str] | None = None) LLMJudgeEvalResult [source]#
Get the judgement of the predicted answer for a list of questions.
- Parameters:
questions (List[str]) – List of question strings.
gt_answers (List[str]) – List of ground truth answer strings.
pred_answers (List[str]) – List of predicted answer strings.
judgement_query (str) – Judgement query string.
- Returns:
The evaluation result.
- Return type:
LLMEvalResult