Evaluator Module¶

The evaluator module contains various evaluation methods for different multimodal tasks.

Base Evaluator¶

class flagevalmm.evaluator.base_evaluator.QuestionMapping(original_question_id: str, is_multi_inference: bool, inference_index: int, total_inferences: int)[source]¶

Bases: object

original_question_id: str¶

is_multi_inference: bool¶

inference_index: int¶

total_inferences: int¶

__init__(original_question_id: str, is_multi_inference: bool, inference_index: int, total_inferences: int) → None¶

class flagevalmm.evaluator.base_evaluator.BaseEvaluator(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs)[source]¶

Bases: object

__init__(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs) → None[source]¶

get_eval_func(eval_func: Callable | str | None)[source]¶

statistics_tokens(predictions: List[Dict]) → Dict[source]¶

expand_multi_inference_predictions(predictions: List[Dict]) → Tuple[List[Dict], Dict[int, QuestionMapping]][source]¶

Expand multiple inference predictions into individual predictions.

Returns:: List of individual predictions question_mapping: Mapping from expanded prediction index to original question info
Return type:: expanded_predictions

aggregate_multi_inference_results(expanded_predictions: List[Dict], question_mapping: Dict[int, QuestionMapping]) → Tuple[List[Dict], Dict][source]¶

Aggregate results from expanded predictions back to original questions.

Returns:: List of predictions with aggregated results stats: Statistics about the evaluation
Return type:: aggregated_predictions

has_multi_inference(predictions: List[Dict]) → bool[source]¶: Check if any prediction contains multiple inference results.

evaluate_multiple_choice(gt: Dict, pred: Dict) → bool[source]¶

evaluate_fill_blank_by_rule(gt: Dict, pred: Dict, simality_threshold: float = 0.7) → Tuple[bool, str][source]¶

evaluate_multiple_response(gt: Dict, pred: Dict) → Tuple[bool, str][source]¶

extract_judgement_result(response_text: str) → Tuple[bool, str][source]¶: Extract judgement result from LLM response using regex. Validates format and extracts the result in one step. Returns: (is_correct, extracted_response)

evaluate_by_llm(gt: Dict, pred: Dict) → Tuple[bool, str][source]¶

cal_accuracy(annotations: Dict, predictions: List[Dict], *args, **kwargs) → Dict[source]¶

maybe_clean_answer(answer: str) → str[source]¶

filter_rejected(predictions: List[Dict], results: Dict) → Tuple[List[Dict], List[Dict]][source]¶

process(dataset: Dataset, output_dir: str, **kwargs) → Dict[source]¶

Parameters:

dataset (Dataset) – dataset instance
output_dir – str

save(results: Dict, answers: List[Dict], dataset_name: str, output_dir: str)[source]¶

MMMU Evaluator¶

flagevalmm.evaluator.mmmu_dataset_evaluator.check_is_number(string)[source]¶: Check if the given string a number.

flagevalmm.evaluator.mmmu_dataset_evaluator.normalize_str(string)[source]¶: Normalize the str to lower case and make them float numbers if possible.

flagevalmm.evaluator.mmmu_dataset_evaluator.extract_numbers(string)[source]¶: Exact all forms of numbers from a string with regex.

flagevalmm.evaluator.mmmu_dataset_evaluator.parse_open_response(response)[source]¶: Parse the prediction from the generated response. Return a list of predicted strings or numbers.

flagevalmm.evaluator.mmmu_dataset_evaluator.eval_open(gold_i, pred_i)[source]¶: Evaluate an open question instance

class flagevalmm.evaluator.mmmu_dataset_evaluator.MmmuEvaluator(is_clean: bool = True, use_llm_evaluator: bool = False, eval_func: Callable | str | None = None, base_dir: str = '', detailed_keys: List[str] | None = None, aggregation_fields: List[str] | None = ['raw_answer'], **kwargs)[source]¶

Bases: BaseEvaluator

The evaluation method is adapted from the official MMMU benchmark evaluation code (https://github.com/MMMU-Benchmark/MMMU/tree/main/mmmu) with modifications to improve robustness and adapt to the flagevalmm framework.

cal_accuracy(annotation, answers)[source]¶

Extract Evaluator¶

class flagevalmm.evaluator.extract_evaluator.ExtractEvaluator(eval_model_name: str, use_llm_evaluator: bool = True, backend: str = 'vllm', port: int = 8001, eval_func: Callable | str | None = None, num_threads: int = 8, eval_method: str = 'extract_compare', **kwargs)[source]¶

Bases: BaseEvaluator

The evaluation method is implemented to utilize the llm to extract the answer from the model response. Two evaluation methods are supported: 1. Extract + Compare: First extract answer from model response, then compare with ground truth 2. SimpleQA: Directly grade the model response using SimpleQA grading template

__init__(eval_model_name: str, use_llm_evaluator: bool = True, backend: str = 'vllm', port: int = 8001, eval_func: Callable | str | None = None, num_threads: int = 8, eval_method: str = 'extract_compare', **kwargs) → None[source]¶

extract_answer_by_llm(gt: Dict, pred: Dict)[source]¶

compare_answer(gt: Dict, extracted_answer: str)[source]¶

grade_by_simpleqa(gt: Dict, pred: Dict) → Tuple[str, int][source]¶: Grade the prediction using SimpleQA grading template

process_single_prediction(pred: Dict, gt: Dict) → Tuple[Dict, int][source]¶: Process a single prediction in a thread-safe manner

cal_accuracy(annotations: Dict, predictions: List[Dict], *args, **kwargs) → Dict[source]¶

process(dataset: Dataset, output_dir: str, **kwargs) → Dict[source]¶

Parameters:

dataset (Dataset) – dataset instance
output_dir – str

Retrieval Evaluator¶

flagevalmm.evaluator.retrieval_evaluator.i2t(probs: ndarray, return_ranks: bool = False)[source]¶

flagevalmm.evaluator.retrieval_evaluator.t2i(probs: ndarray, return_ranks: bool = False)[source]¶

flagevalmm.evaluator.retrieval_evaluator.json_save(content: Dict[str, Any], jf_nm: str) → None[source]¶

class flagevalmm.evaluator.retrieval_evaluator.RetrievalEvaluator(**kwargs)[source]¶

Bases: object

__init__(**kwargs)[source]¶

process(dataset, output_dir, **kwargs)[source]¶

Common Types¶

Pre-processing Utilities¶

flagevalmm.evaluator.pre_process.strip_answer(answer)[source]¶

flagevalmm.evaluator.pre_process.remove_special_characters(text)[source]¶

flagevalmm.evaluator.pre_process.process_multiple_choice(answer)[source]¶

flagevalmm.evaluator.pre_process.remove_unit(value)[source]¶

flagevalmm.evaluator.pre_process.convert_circled_numbers(text)[source]¶

flagevalmm.evaluator.pre_process.normalize_string(raw_answer)[source]¶