FlagEvalMM Documentation¶

A Flexible Framework for Comprehensive Multimodal Model Evaluation

FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.

User Guide

Additional Information

GitHub Repository

Key Features¶

Flexible Architecture: Support for multiple multimodal models and evaluation tasks, including VQA, image retrieval, text-to-image, etc.
Comprehensive Benchmarks and Metrics: Support for new and commonly used benchmarks and metrics.
Extensive Model Support: The model_zoo provides inference support for a wide range of popular multimodal models including QwenVL and LLaVA. Additionally, it offers seamless integration with API-based models such as GPT, Claude, and HuanYuan.
Extensible Design: Easily extendable to incorporate new models, benchmarks, and evaluation metrics.

Quick Start¶

Install FlagEvalMM:

git clone https://github.com/flageval-baai/FlagEvalMM.git
cd FlagEvalMM
pip install -e .

Run a basic evaluation:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --num-workers 8 \
        --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
        --backend vllm \
        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"

FlagEvalMM Documentation¶

Key Features¶

Quick Start¶

Indices and tables¶