Evaluator Module¶
The evaluator module contains various evaluation methods for different multimodal tasks.
Base Evaluator¶
- class flagevalmm.evaluator.base_evaluator.QuestionMapping(original_question_id: str, is_multi_inference: bool, inference_index: int, total_inferences: int)[source]¶
Bases:
object
- class flagevalmm.evaluator.base_evaluator.BaseEvaluator(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs)[source]¶
Bases:
object- __init__(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs) None[source]¶
- expand_multi_inference_predictions(predictions: List[Dict]) Tuple[List[Dict], Dict[int, QuestionMapping]][source]¶
Expand multiple inference predictions into individual predictions.
- Returns:
List of individual predictions question_mapping: Mapping from expanded prediction index to original question info
- Return type:
expanded_predictions
- aggregate_multi_inference_results(expanded_predictions: List[Dict], question_mapping: Dict[int, QuestionMapping]) Tuple[List[Dict], Dict][source]¶
Aggregate results from expanded predictions back to original questions.
- Returns:
List of predictions with aggregated results stats: Statistics about the evaluation
- Return type:
aggregated_predictions
- has_multi_inference(predictions: List[Dict]) bool[source]¶
Check if any prediction contains multiple inference results.
- evaluate_fill_blank_by_rule(gt: Dict, pred: Dict, simality_threshold: float = 0.7) Tuple[bool, str][source]¶
- extract_judgement_result(response_text: str) Tuple[bool, str][source]¶
Extract judgement result from LLM response using regex. Validates format and extracts the result in one step. Returns: (is_correct, extracted_response)
MMMU Evaluator¶
- flagevalmm.evaluator.mmmu_dataset_evaluator.check_is_number(string)[source]¶
Check if the given string a number.
- flagevalmm.evaluator.mmmu_dataset_evaluator.normalize_str(string)[source]¶
Normalize the str to lower case and make them float numbers if possible.
- flagevalmm.evaluator.mmmu_dataset_evaluator.extract_numbers(string)[source]¶
Exact all forms of numbers from a string with regex.
- flagevalmm.evaluator.mmmu_dataset_evaluator.parse_open_response(response)[source]¶
Parse the prediction from the generated response. Return a list of predicted strings or numbers.
- flagevalmm.evaluator.mmmu_dataset_evaluator.eval_open(gold_i, pred_i)[source]¶
Evaluate an open question instance
- class flagevalmm.evaluator.mmmu_dataset_evaluator.MmmuEvaluator(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs)[source]¶
Bases:
BaseEvaluatorThe evaluation method is adapted from the official MMMU benchmark evaluation code (https://github.com/MMMU-Benchmark/MMMU/tree/main/mmmu) with modifications to improve robustness and adapt to the flagevalmm framework.
Extract Evaluator¶
- class flagevalmm.evaluator.extract_evaluator.ExtractEvaluator(eval_model_name: str, use_llm_evaluator: bool = True, backend: str = 'vllm', port: int = 8001, eval_func: Callable | str | None = None, num_threads: int = 8, eval_method: str = 'extract_compare', **kwargs)[source]¶
Bases:
BaseEvaluatorThe evaluation method is implemented to utilize the llm to extract the answer from the model response. Two evaluation methods are supported: 1. Extract + Compare: First extract answer from model response, then compare with ground truth 2. SimpleQA: Directly grade the model response using SimpleQA grading template
- __init__(eval_model_name: str, use_llm_evaluator: bool = True, backend: str = 'vllm', port: int = 8001, eval_func: Callable | str | None = None, num_threads: int = 8, eval_method: str = 'extract_compare', **kwargs) None[source]¶
- grade_by_simpleqa(gt: Dict, pred: Dict) Tuple[str, int][source]¶
Grade the prediction using SimpleQA grading template