Tong Li†, Guodao Sun†, Shunkai Wang†, Zuoyu Tang†, Yang Shu†, Xueqian Zheng†, Haixia Wang†, Ronghua Liang‡
†Zhejiang University of Technology ‡Zhejiang University of Science and Technology
Existing ChartQA evaluations for multimodal large language models focuses on visual-only input, and rely solely on “black-box” accuracy metrics, offering limited insight into reasoning traces. To fill these gaps, we introduce Mega60k, a benchmark covering 21 chart types and 11 QA tasks; collect ChartQA reasoning traces from MLLMs; and propose the reasoning deconstruction framework to parse multimodal activation patterns and reasoning evidence usage. Evaluating 12 representative MLLMs (7 open-source and 5 closed-source) under three conditions—visual-only, multimodal fusion, and multimodal compensation—reveals key findings: High-level tasks (multi-step logic, visual pattern recognition, layout optimization) serve as gold-standard for distinguishing MLLMs; Mere modality stacking struggles to extend reasoning boundaries but shows compensatory potential in quantitative visual understanding tasks; Gemini 2.5 Flash and GPT-4o demonstrate positive signals in leveraging structured modality reasoning to mitigate visual degradation such as omissions, occlusion, blurring, and rotation.
We employ six key metrics to comprehensively assess model performance:
To evaluate the multimodal reasoning capabilities of MLLMs, we design three experimental configurations:
MLLMs | Eaperiment | CTR | VEC | SRP | VPR | VE | EVJ | SC | NF | NC | MSR | VA | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tacc | Racc | Tacc | Tacc/Racc | Tacc/Racc | Tacc/Racc | Racc | Macc | Tacc/Racc | Tacc/Racc | Tacc/Racc | |||
open-source models | |||||||||||||
CogVLM2 | visual | 5.80% | 15.30% | 16.20% | 7.30% | 7.00% | 5.20% | 2.90% | 4.50% | 35.90% | 5.80% | 11.40% | 10.66% |
LLaVA 1.5 | visual | 9.70% | 49.70% | 13.60% | 2.30% | 2.00% | 1.80% | 2.10% | 2.70% | 37.70% | 3.90% | 7.40% | 12.08% |
DeepSeek-VL2 | visual | 85.80% | 74.80% | 52.90% | 31.00% | 40.50% | 32.70% | 11.90% | 16.00% | 49.30% | 19.20% | 28.30% | 40.22% |
InternVL3 | visual | 75.90% | 84.10% | 57.10% | 29.59% | 58.10% | 45.70% | 35.40% | 21.50% | 64.60% | 22.60% | 27.80% | 47.49% |
fusion | 76.50%↑ | 75.60%↓ | 22.80%↓ | 4.00%↓ | 10.10%↓ | 5.10%↓ | 40.70%↑ | 18.50%↓ | 49.50%↓ | 4.10%↓ | 1.50%↓ | 28.04%↓ | |
compensation | 88.10%↑ | 88.00%↑ | 50.00%↓ | 23.3%↓ | 75.20%↑ | 48.00%↑ | 55.80%↑ | 17.10%↓ | 63.00%↓ | 8.10%↓ | 45.50%↑ | 51.10%↑ | |
LLaMA4 Maverick | visual | 100.00% | 84.90% | 73.60% | 46.30% | 56.80% | 49.70% | 47.50% | 28.40% | 71.60% | 39.10% | 38.80% | 57.88% |
fusion | 100.00% | 84.40%↓ | 48.80%↓ | 23.90%↓ | 25.50%↓ | 19.50%↓ | 22.40%↓ | 10.70%↓ | 56.80%↓ | 19.70%↓ | 19.90%↓ | 39.24%↓ | |
compensation | 99.90%↓ | 86.30%↑ | 63.30%↓ | 36.70%↓ | 41.70%↓ | 38.90%↓ | 32.60%↓ | 18.80%↓ | 62.30%↓ | 30.40%↓ | 22.40%↓ | 48.48%↓ | |
Qwen2.5-VL-32B | visual | 99.10% | 84.90% | 69.60% | 40.60% | 53.90% | 45.10% | 37.20% | 24.40% | 64.60% | 36.10% | 36.40% | 53.81% |
fusion | 99.10% | 84.90% | 68.30%↓ | 38.40%↓ | 48.80%↓ | 43.50%↓ | 33.80%↓ | 21.70%↓ | 63.10%↓ | 32.90%↓ | 34.30%↓ | 51.71%↓ | |
compensation | 99.10% | 81.90%↓ | 57.60%↓ | 27.70%↓ | 37.30%↓ | 33.60%↓ | 25.70%↓ | 14.20%↓ | 56.70%↓ | 25.70%↓ | 29.40%↓ | 44.45%↓ | |
Qwen2.5-VL-72B | visual | 99.80% | 85.00% | 69.50% | 39.80% | 58.50% | 47.00% | 43.30% | 23.80% | 67.80% | 37.30% | 34.30% | 55.10% |
fusion | 99.40%↓ | 86.70%↑ | 66.30%↓ | 36.10%↓ | 52.30%↓ | 44.50%↓ | 37.80%↓ | 20.80%↓ | 64.60%↓ | 32.80%↓ | 34.60%↑ | 52.35%↓ | |
compensation | 99.00%↓ | 84.20%↓ | 57.80%↓ | 26.90%↓ | 38.00%↓ | 35.70%↓ | 27.90%↓ | 13.40%↓ | 58.00%↓ | 25.10%↓ | 28.50%↓ | 44.95%↓ | |
closed-source models | |||||||||||||
Claude 3.5 Haiku | visual | 100.00% | 85.20% | 67.10% | 40.00% | 53.40% | 46.50% | 33.90% | 23.00% | 65.60% | 32.30% | 32.50% | 52.68% |
fusion | 100.00% | 87.80%↑ | 65.10%↓ | 38.50%↓ | 51.30%↓ | 45.30%↓ | 31.40%↓ | 22.50%↓ | 62.80%↓ | 33.00%↑ | 29.90%↓ | 51.60%↓ | |
compensation | 99.60%↓ | 91.00%↑ | 57.90%↓ | 27.20%↓ | 37.60%↓ | 40.10%↓ | 22.60%↓ | 18.20%↓ | 56.40%↓ | 28.90%↓ | 24.00%↓ | 45.77%↓ | |
GLM-4V-Plus | visual | 99.90% | 85.50% | 73.70% | 47.00% | 63.80% | 53.30% | 45.00% | 27.30% | 76.20% | 42.30% | 40.20% | 59.47% |
fusion | 99.70%↓ | 81.40%↓ | 62.20%↓ | 38.10%↓ | 46.50%↓ | 42.80%↓ | 33.90%↓ | 18.70%↓ | 68.30%↓ | 32.40%↓ | 35.90%↓ | 50.90%↓ | |
compensation | 99.30%↓ | 80.60%↓ | 54.50%↓ | 31.10%↓ | 38.40%↓ | 35.00%↓ | 28.30%↓ | 16.30%↓ | 63.10%↓ | 28.10%↓ | 27.30%↓ | 45.64%↓ | |
Doubao 1.5 Vision-Pro | visual | 100.00% | 86.00% | 75.60% | 44.20% | 63.10% | 48.80% | 41.30% | 35.20% | 72.60% | 42.20% | 46.30% | 59.57% |
fusion | 100.00% | 87.20%↑ | 73.10%↓ | 44.90%↑ | 60.30%↓ | 51.30%↑ | 38.70%↓ | 34.60%↓ | 71.50%↓ | 36.90%↓ | 43.90%↓ | 58.40%↓ | |
compensation | 99.7%↓ | 85.80%↓ | 61.50%↓ | 32.10%↓ | 42.50%↓ | 38.70%↓ | 23.70%↓ | 21.50%↓ | 58.80%↓ | 23.70%↓ | 22.50%↓ | 46.41%↓ | |
GPT-4o | visual | 100.00% | 89.40% | 80.90% | 55.10% | 77.80% | 59.20% | 61.00% | 41.70% | 77.10% | 52.40% | 44.90% | 67.23% |
fusion | 100.00% | 90.50%↑ | 80.10%↓ | 57.49%↑ | 70.70%↓ | 55.00%↓ | 56.60%↓ | 39.10%↓ | 74.80%↓ | 54.50%↑ | 35.20%↓ | 64.91%↓ | |
compensation | 100.00% | 87.60%↓ | 70.60%↓ | 45.50%↓ | 52.90%↓ | 47.40%↓ | 44.90%↓ | 31.90%↓ | 67.20%↓ | 38.10%↓ | 22.30%↓ | 55.31%↓ | |
Gemini 2.5 Flash | visual | 99.50% | 88.60% | 81.40% | 53.90% | 67.90% | 56.90% | 56.50% | 44.20% | 82.10% | 58.10% | 45.80% | 66.81% |
fusion | 99.60%↑ | 95.20%↑ | 84.00%↑ | 58.40%↑ | 73.00%↑ | 61.40%↑ | 64.90%↑ | 52.60%↑ | 84.30%↑ | 65.80%↑ | 46.70%↑ | 71.45%↑ | |
compensation | 99.90%↑ | 94.90%↑ | 75.10%↓ | 51.50%↓ | 61.80%↓ | 58.30%↓ | 55.20%↓ | 43.20%↓ | 75.60%↓ | 55.40%↓ | 32.60%↓ | 63.95%↓ |
Statistical analysis based on the coefficient of variation:
This section presents fine-grained degradation and compensation question-answering statistics. Performance changes exceeding 5% are highlighted.
MLLMs | Structural-level | Pixel-level | Average | |||||
---|---|---|---|---|---|---|---|---|
data mark omission | occlusion | label omission | axis omission | legend omission | blurring | rotation | ||
Qwen2.5-VL-72B | 42.53% | 53.65% | 46.24% | 46.65% | 50.80% | 44.12% | 46.13% | 47.09% |
43.66% | 56.73% | 58.95%↑12.71% | 41.45%↓5.20% | 48.97% | 45.39% | 45.29% | 47.73% | |
GPT-4o | 49.37% | 59.21% | 53.11% | 54.62% | 60.07% | 55.23% | 48.40% | 54.33% |
52.87% | 68.10%↑8.89% | 71.72%↑18.61% | 55.51% | 65.23%↑5.16% | 59.59% | 52.51% | 60.13%↑5.80% | |
Gemini 2.5 Flash | 44.92% | 59.11% | 43.77% | 52.61% | 57.62% | 58.48% | 55.41% | 53.07% |
65.50%↑20.58% | 81.44%↑22.34% | 83.62%↑39.85% | 56.43% | 64.71%↑7.09% | 68.68%↑10.20% | 64.90%↑9.48% | 67.71%↑14.65% | |
Average | 45.61% | 57.32% | 47.71% | 51.29% | 56.17% | 52.61% | 49.98% | |
54.01%↑8.40% | 68.76%↑11.44% | 71.43%↑23.72% | 51.13% | 59.64% | 57.88%↑5.28% | 54.23% |
Degradation degree and question-answering accuracy statistics:
We employ a Test-Time Scaling-based framework to achieve fine-grained deconstruction of reasoning traces, analyzing both modality classification and chart component evidence dependencies in MLLM reasoning processes.
Positive Reasoning Instances:
Negative Reasoning Instances:
Modality Activation Frequency and Temporal Patterns:
Chart Component Dependency Patterns:
If you find this work useful for your research, please cite our paper:
@article{ChartMind2025li,
title={ChartMind: Benchmark and Reasoning Insights of Multimodal Chart Question Answering},
author={Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Haixia Wang and Ronghua Liang},
year={2025}
}
For questions about this work, please contact: Tong Li (litong@zjut.edu.cn, https://tongli97.github.io/)