ChartMind

ChartMind : Benchmark and Reasoning Insights of Multimodal Chart Question Answering

Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Haixia Wang, Ronghua Liang

Zhejiang University of Technology Zhejiang University of Science and Technology

📑 Paper ⭐ Code 🧱 Dataset
ChartMind Overview

Abstract

Existing ChartQA evaluations for multimodal large language models focuses on visual-only input, and rely solely on “black-box” accuracy metrics, offering limited insight into reasoning traces. To fill these gaps, we introduce Mega60k, a benchmark covering 21 chart types and 11 QA tasks; collect ChartQA reasoning traces from MLLMs; and propose the reasoning deconstruction framework to parse multimodal activation patterns and reasoning evidence usage. Evaluating 12 representative MLLMs (7 open-source and 5 closed-source) under three conditions—visual-only, multimodal fusion, and multimodal compensation—reveals key findings: High-level tasks (multi-step logic, visual pattern recognition, layout optimization) serve as gold-standard for distinguishing MLLMs; Mere modality stacking struggles to extend reasoning boundaries but shows compensatory potential in quantitative visual understanding tasks; Gemini 2.5 Flash and GPT-4o demonstrate positive signals in leveraging structured modality reasoning to mitigate visual degradation such as omissions, occlusion, blurring, and rotation.

Mega60k Overview

21 Chart Types

21 Chart Types in ChartMind

11 Question Tasks

ChartMind Evaluation Space

Evaluation Metrics

We employ six key metrics to comprehensively assess model performance:

Multimodal Evaluation

To evaluate the multimodal reasoning capabilities of MLLMs, we design three experimental configurations:

MLLMs Eaperiment CTR VEC SRP VPR VE EVJ SC NF NC MSR VA Average
Tacc Racc Tacc Tacc/Racc Tacc/Racc Tacc/Racc Racc Macc Tacc/Racc Tacc/Racc Tacc/Racc
open-source models
CogVLM2 visual 5.80% 15.30% 16.20% 7.30% 7.00% 5.20% 2.90% 4.50% 35.90% 5.80% 11.40% 10.66%
LLaVA 1.5 visual 9.70% 49.70% 13.60% 2.30% 2.00% 1.80% 2.10% 2.70% 37.70% 3.90% 7.40% 12.08%
DeepSeek-VL2 visual 85.80% 74.80% 52.90% 31.00% 40.50% 32.70% 11.90% 16.00% 49.30% 19.20% 28.30% 40.22%
InternVL3 visual 75.90% 84.10% 57.10% 29.59% 58.10% 45.70% 35.40% 21.50% 64.60% 22.60% 27.80% 47.49%
fusion 76.50% 75.60% 22.80% 4.00% 10.10% 5.10% 40.70% 18.50% 49.50% 4.10% 1.50% 28.04%
compensation 88.10% 88.00% 50.00% 23.3% 75.20% 48.00% 55.80% 17.10% 63.00% 8.10% 45.50% 51.10%
LLaMA4 Maverick visual 100.00% 84.90% 73.60% 46.30% 56.80% 49.70% 47.50% 28.40% 71.60% 39.10% 38.80% 57.88%
fusion 100.00% 84.40% 48.80% 23.90% 25.50% 19.50% 22.40% 10.70% 56.80% 19.70% 19.90% 39.24%
compensation 99.90% 86.30% 63.30% 36.70% 41.70% 38.90% 32.60% 18.80% 62.30% 30.40% 22.40% 48.48%
Qwen2.5-VL-32B visual 99.10% 84.90% 69.60% 40.60% 53.90% 45.10% 37.20% 24.40% 64.60% 36.10% 36.40% 53.81%
fusion 99.10% 84.90% 68.30% 38.40% 48.80% 43.50% 33.80% 21.70% 63.10% 32.90% 34.30% 51.71%
compensation 99.10% 81.90% 57.60% 27.70% 37.30% 33.60% 25.70% 14.20% 56.70% 25.70% 29.40% 44.45%
Qwen2.5-VL-72B visual 99.80% 85.00% 69.50% 39.80% 58.50% 47.00% 43.30% 23.80% 67.80% 37.30% 34.30% 55.10%
fusion 99.40% 86.70% 66.30% 36.10% 52.30% 44.50% 37.80% 20.80% 64.60% 32.80% 34.60% 52.35%
compensation 99.00% 84.20% 57.80% 26.90% 38.00% 35.70% 27.90% 13.40% 58.00% 25.10% 28.50% 44.95%
closed-source models
Claude 3.5 Haiku visual 100.00% 85.20% 67.10% 40.00% 53.40% 46.50% 33.90% 23.00% 65.60% 32.30% 32.50% 52.68%
fusion 100.00% 87.80% 65.10% 38.50% 51.30% 45.30% 31.40% 22.50% 62.80% 33.00% 29.90% 51.60%
compensation 99.60% 91.00% 57.90% 27.20% 37.60% 40.10% 22.60% 18.20% 56.40% 28.90% 24.00% 45.77%
GLM-4V-Plus visual 99.90% 85.50% 73.70% 47.00% 63.80% 53.30% 45.00% 27.30% 76.20% 42.30% 40.20% 59.47%
fusion 99.70% 81.40% 62.20% 38.10% 46.50% 42.80% 33.90% 18.70% 68.30% 32.40% 35.90% 50.90%
compensation 99.30% 80.60% 54.50% 31.10% 38.40% 35.00% 28.30% 16.30% 63.10% 28.10% 27.30% 45.64%
Doubao 1.5 Vision-Pro visual 100.00% 86.00% 75.60% 44.20% 63.10% 48.80% 41.30% 35.20% 72.60% 42.20% 46.30% 59.57%
fusion 100.00% 87.20% 73.10% 44.90% 60.30% 51.30% 38.70% 34.60% 71.50% 36.90% 43.90% 58.40%
compensation 99.7% 85.80% 61.50% 32.10% 42.50% 38.70% 23.70% 21.50% 58.80% 23.70% 22.50% 46.41%
GPT-4o visual 100.00% 89.40% 80.90% 55.10% 77.80% 59.20% 61.00% 41.70% 77.10% 52.40% 44.90% 67.23%
fusion 100.00% 90.50% 80.10% 57.49% 70.70% 55.00% 56.60% 39.10% 74.80% 54.50% 35.20% 64.91%
compensation 100.00% 87.60% 70.60% 45.50% 52.90% 47.40% 44.90% 31.90% 67.20% 38.10% 22.30% 55.31%
Gemini 2.5 Flash visual 99.50% 88.60% 81.40% 53.90% 67.90% 56.90% 56.50% 44.20% 82.10% 58.10% 45.80% 66.81%
fusion 99.60% 95.20% 84.00% 58.40% 73.00% 61.40% 64.90% 52.60% 84.30% 65.80% 46.70% 71.45%
compensation 99.90% 94.90% 75.10% 51.50% 61.80% 58.30% 55.20% 43.20% 75.60% 55.40% 32.60% 63.95%

Statistical analysis based on the coefficient of variation:

CV Overview

Degradation and Compensation Evaluation

This section presents fine-grained degradation and compensation question-answering statistics. Performance changes exceeding 5% are highlighted.

MLLMs Structural-level Pixel-level Average
data mark omission occlusion label omission axis omission legend omission blurring rotation
Qwen2.5-VL-72B 42.53% 53.65% 46.24% 46.65% 50.80% 44.12% 46.13% 47.09%
43.66% 56.73% 58.95%↑12.71% 41.45%↓5.20% 48.97% 45.39% 45.29% 47.73%
GPT-4o 49.37% 59.21% 53.11% 54.62% 60.07% 55.23% 48.40% 54.33%
52.87% 68.10%↑8.89% 71.72%↑18.61% 55.51% 65.23%↑5.16% 59.59% 52.51% 60.13%↑5.80%
Gemini 2.5 Flash 44.92% 59.11% 43.77% 52.61% 57.62% 58.48% 55.41% 53.07%
65.50%↑20.58% 81.44%↑22.34% 83.62%↑39.85% 56.43% 64.71%↑7.09% 68.68%↑10.20% 64.90%↑9.48% 67.71%↑14.65%
Average 45.61% 57.32% 47.71% 51.29% 56.17% 52.61% 49.98%
54.01%↑8.40% 68.76%↑11.44% 71.43%↑23.72% 51.13% 59.64% 57.88%↑5.28% 54.23%

Degradation degree and question-answering accuracy statistics:

CV Overview

Reasoning Instances and Deconstruction

We employ a Test-Time Scaling-based framework to achieve fine-grained deconstruction of reasoning traces, analyzing both modality classification and chart component evidence dependencies in MLLM reasoning processes.

Positive Reasoning Instances:

CV Overview

Negative Reasoning Instances:

CV Overview

Modality Activation Frequency and Temporal Patterns:

CV Overview

Chart Component Dependency Patterns:

CV Overview

Citation

If you find this work useful for your research, please cite our paper:

@article{ChartMind2025li,
    title={ChartMind: Benchmark and Reasoning Insights of Multimodal Chart Question Answering},
    author={Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Haixia Wang and Ronghua Liang},
    year={2025}
}

For questions about this work, please contact: Tong Li (litong@zjut.edu.cn, https://tongli97.github.io/)