Usage Guide¶

This guide covers the various ways to use FlagEvalMM for multimodal model evaluation.

Command Line Interface¶

FlagEvalMM provides a command-line interface through the flagevalmm command.

Basic Syntax¶

flagevalmm [OPTIONS] --tasks TASK_FILES --exec MODEL_ADAPTER --model MODEL_NAME

Required Arguments¶

--tasks: Path(s) to task configuration files
--exec: Path to model adapter script
--model: Model name or path

Optional Arguments¶

--num-workers: Number of parallel workers (default: 1)
--output-dir: Output directory for results (default: ./results)
--backend: Inference backend (vllm, transformers, sglang, lmdeploy)
--extra-args: Additional arguments for the backend
--cfg: Configuration file path
--api-key: API key for API-based models
--url: API endpoint URL
--use-cache: Enable response caching
--without-infer: Skip inference and only run evaluation
--try-run: Run in debug mode with limited samples

Configuration Files¶

JSON Configuration¶

You can use JSON configuration files to simplify complex commands:

{
    "model_name": "Qwen/Qwen2-VL-7B-Instruct",
    "api_key": "EMPTY",
    "output_dir": "./results/qwen2-vl-7b",
    "num_workers": 8,
    "backend": "vllm",
    "extra_args": "--limit-mm-per-prompt image=10 --max-model-len 32768"
}

Task Configuration¶

Task configuration files define the dataset and evaluation settings:

# Example task configuration
dataset = dict(
    type='MMBenchDataset',
    data_file='path/to/dataset.json',
    name='mmbench_dev_en',
    debug=False
)

evaluator = dict(
    type='MultipleChoiceEvaluator'
)

Evaluation Modes¶

Single Task Evaluation¶

Evaluate a single task:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --output-dir ./results/single-task

Multi-Task Evaluation¶

Evaluate multiple tasks in one run:

flagevalmm --tasks tasks/mmmu/mmmu_val.py tasks/mmvet/mmvet_v2.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --output-dir ./results/multi-task

Batch Model Evaluation¶

Use the batch evaluation tool for multiple models:

python tools/run_models.py --config tools/configs/batch_config.py --models-base-dir /path/to/models

Backend-Specific Usage¶

VLLM Backend¶

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --backend vllm \
        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"

Multi-GPU with VLLM:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model Qwen/Qwen2-VL-72B-Instruct \
        --backend vllm \
        --extra-args "--tensor-parallel-size 4 --max-model-len 32768"

Transformers Backend¶

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/llama-vision/model_adapter.py \
        --model meta-llama/Llama-3.2-11B-Vision-Instruct \
        --output-dir ./results/llama-vision

SGLang Backend¶

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --backend sglang \
        --extra-args "--mem-fraction-static 0.8"

API-Based Models¶

OpenAI GPT models:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model gpt-4o-mini \
        --url https://api.openai.com/v1/chat/completions \
        --api-key $OPENAI_API_KEY \
        --use-cache

Output and Results¶

Result Structure¶

After evaluation, results are organized as follows:

output_dir/
├── model_name/
│   ├── task_name/
│   │   ├── results.json          # Main results
│   │   ├── detailed_results.json # Per-sample results
│   │   ├── predictions.json      # Model predictions
│   │   └── logs/                 # Evaluation logs
│   └── summary.json              # Cross-task summary

Result Formats¶

The main results file contains:

{
    "accuracy": 85.2,
    "total_samples": 1000,
    "correct_samples": 852,
    "subject_scores": {
        "math": 78.5,
        "science": 89.3,
        "history": 87.1
    },
    "metadata": {
        "model": "llava-hf/llava-onevision-qwen2-7b-ov-chat-hf",
        "task": "mmmu_val",
        "timestamp": "2024-01-01T12:00:00"
    }
}

Advanced Usage¶

Custom Model Adapters¶

Create custom model adapters for new models by extending the base adapter:

from model_zoo.base_adapter import BaseAdapter

class CustomModelAdapter(BaseAdapter):
    def __init__(self, model_path):
        super().__init__(model_path)
        # Custom initialization

    def predict(self, inputs):
        # Custom prediction logic
        return predictions

Custom Evaluation Metrics¶

Define custom evaluators for specific tasks:

from flagevalmm.evaluator import BaseEvaluator
from flagevalmm.registry import EVALUATORS

@EVALUATORS.register_module()
class CustomEvaluator(BaseEvaluator):
    def evaluate(self, predictions, annotations):
        # Custom evaluation logic
        return results

Performance Optimization¶

Memory Management¶

Use --num-workers to control parallel processing
Adjust batch sizes in model adapters
Use gradient checkpointing for large models

GPU Utilization¶

Use --tensor-parallel-size for multi-GPU inference
Monitor GPU memory usage
Consider model quantization for memory efficiency

Caching¶

Enable --use-cache to avoid re-computation
Cache is stored in ~/.cache/flagevalmm by default
Clear cache periodically to save disk space

Troubleshooting¶

Common Issues¶

Out of Memory: Reduce batch size or use model sharding Slow Inference: Check GPU utilization and consider using VLLM backend Model Loading Issues: Verify model path and access permissions Task Configuration Errors: Check task file syntax and required fields

Debug Mode¶

Use --try-run for quick debugging with limited samples:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --try-run