Tong Li†, Guodao Sun†, Shunkai Wang†, Zuoyu Tang†, Yang Shu†, Xueqian Zheng†, Zhentao Zheng†,
Qi Jiang‡, Haixia Wang†, Ronghua Liang‡
†Zhejiang University of Technology ‡Zhejiang University of Science and Technology
Existing MLLMs chart reasoning evaluations are limited to single visual inputs and “black-box” metrics, lacking “white-box” analysis of thinking processes and reasoning patterns. To alleviate this, we present Mega60K, a benchmark covering 21 chart types and 11 question-answer tasks, enriched with reasoning traces. We further introduce the reasoning deconstruction framework to quantify multimodal activation strategies. Additionally, we conduct the reasoning paradigm transfer experiment to explore logic alignment via reasoning supervision. Evaluations across 12 representative MLLMs under three reasoning settings—visual-only, multimodal fusion, and multimodal compensation—yield the following insights: High-level reasoning tasks (e.g., multi-step logic, pattern recognition, layout optimization) serve as gold-standard for distinguishing MLLMs; Naive modality stacking offers limited reasoning gains, while structured modalities yield measurable compensatory effects in quantitative tasks under visual degradation. Reasoning paradigm transfer improves the student model (Qwen2.5-VL-7B) by 18.77% in reasoning accuracy and enables alignment with the teacher model’s (Gemini 2.5 Flash) reasoning style.
We employ six key metrics to comprehensively assess model performance:
To evaluate the multimodal reasoning capabilities of MLLMs, we design three experimental configurations:
| MLLMs | Eaperiment | CTR | VEC | SRP | VPR | VE | EVJ | SC | NF | NC | MSR | VA | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tacc | Racc | Tacc | Tacc/Racc | Tacc/Racc | Tacc/Racc | Racc | Macc | Tacc/Racc | Tacc/Racc | Tacc/Racc | |||
| open-source models | |||||||||||||
| CogVLM2 | visual | 5.80% | 15.30% | 16.20% | 7.30% | 7.00% | 5.20% | 2.90% | 4.50% | 35.90% | 5.80% | 11.40% | 10.66% |
| LLaVA 1.5 | visual | 9.70% | 49.70% | 13.60% | 2.30% | 2.00% | 1.80% | 2.10% | 2.70% | 37.70% | 3.90% | 7.40% | 12.08% |
| DeepSeek-VL2 | visual | 85.80% | 74.80% | 52.90% | 31.00% | 40.50% | 32.70% | 11.90% | 16.00% | 49.30% | 19.20% | 28.30% | 40.22% |
| InternVL3 | visual | 75.90% | 84.10% | 57.10% | 29.59% | 58.10% | 45.70% | 35.40% | 21.50% | 64.60% | 22.60% | 27.80% | 47.49% |
| fusion | 76.50%↑ | 75.60%↓ | 22.80%↓ | 4.00%↓ | 10.10%↓ | 5.10%↓ | 40.70%↑ | 18.50%↓ | 49.50%↓ | 4.10%↓ | 1.50%↓ | 28.04%↓ | |
| compensation | 88.10%↑ | 88.00%↑ | 50.00%↓ | 23.3%↓ | 75.20%↑ | 48.00%↑ | 55.80%↑ | 17.10%↓ | 63.00%↓ | 8.10%↓ | 45.50%↑ | 51.10%↑ | |
| LLaMA4 Maverick | visual | 100.00% | 84.90% | 73.60% | 46.30% | 56.80% | 49.70% | 47.50% | 28.40% | 71.60% | 39.10% | 38.80% | 57.88% |
| fusion | 100.00% | 84.40%↓ | 48.80%↓ | 23.90%↓ | 25.50%↓ | 19.50%↓ | 22.40%↓ | 10.70%↓ | 56.80%↓ | 19.70%↓ | 19.90%↓ | 39.24%↓ | |
| compensation | 99.90%↓ | 86.30%↑ | 63.30%↓ | 36.70%↓ | 41.70%↓ | 38.90%↓ | 32.60%↓ | 18.80%↓ | 62.30%↓ | 30.40%↓ | 22.40%↓ | 48.48%↓ | |
| Qwen2.5-VL-32B | visual | 99.10% | 84.90% | 69.60% | 40.60% | 53.90% | 45.10% | 37.20% | 24.40% | 64.60% | 36.10% | 36.40% | 53.81% |
| fusion | 99.10% | 84.90% | 68.30%↓ | 38.40%↓ | 48.80%↓ | 43.50%↓ | 33.80%↓ | 21.70%↓ | 63.10%↓ | 32.90%↓ | 34.30%↓ | 51.71%↓ | |
| compensation | 99.10% | 81.90%↓ | 57.60%↓ | 27.70%↓ | 37.30%↓ | 33.60%↓ | 25.70%↓ | 14.20%↓ | 56.70%↓ | 25.70%↓ | 29.40%↓ | 44.45%↓ | |
| Qwen2.5-VL-72B | visual | 99.80% | 85.00% | 69.50% | 39.80% | 58.50% | 47.00% | 43.30% | 23.80% | 67.80% | 37.30% | 34.30% | 55.10% |
| fusion | 99.40%↓ | 86.70%↑ | 66.30%↓ | 36.10%↓ | 52.30%↓ | 44.50%↓ | 37.80%↓ | 20.80%↓ | 64.60%↓ | 32.80%↓ | 34.60%↑ | 52.35%↓ | |
| compensation | 99.00%↓ | 84.20%↓ | 57.80%↓ | 26.90%↓ | 38.00%↓ | 35.70%↓ | 27.90%↓ | 13.40%↓ | 58.00%↓ | 25.10%↓ | 28.50%↓ | 44.95%↓ | |
| closed-source models | |||||||||||||
| Claude 3.5 Haiku | visual | 100.00% | 85.20% | 67.10% | 40.00% | 53.40% | 46.50% | 33.90% | 23.00% | 65.60% | 32.30% | 32.50% | 52.68% |
| fusion | 100.00% | 87.80%↑ | 65.10%↓ | 38.50%↓ | 51.30%↓ | 45.30%↓ | 31.40%↓ | 22.50%↓ | 62.80%↓ | 33.00%↑ | 29.90%↓ | 51.60%↓ | |
| compensation | 99.60%↓ | 91.00%↑ | 57.90%↓ | 27.20%↓ | 37.60%↓ | 40.10%↓ | 22.60%↓ | 18.20%↓ | 56.40%↓ | 28.90%↓ | 24.00%↓ | 45.77%↓ | |
| GLM-4V-Plus | visual | 99.90% | 85.50% | 73.70% | 47.00% | 63.80% | 53.30% | 45.00% | 27.30% | 76.20% | 42.30% | 40.20% | 59.47% |
| fusion | 99.70%↓ | 81.40%↓ | 62.20%↓ | 38.10%↓ | 46.50%↓ | 42.80%↓ | 33.90%↓ | 18.70%↓ | 68.30%↓ | 32.40%↓ | 35.90%↓ | 50.90%↓ | |
| compensation | 99.30%↓ | 80.60%↓ | 54.50%↓ | 31.10%↓ | 38.40%↓ | 35.00%↓ | 28.30%↓ | 16.30%↓ | 63.10%↓ | 28.10%↓ | 27.30%↓ | 45.64%↓ | |
| Doubao 1.5 Vision-Pro | visual | 100.00% | 86.00% | 75.60% | 44.20% | 63.10% | 48.80% | 41.30% | 35.20% | 72.60% | 42.20% | 46.30% | 59.57% |
| fusion | 100.00% | 87.20%↑ | 73.10%↓ | 44.90%↑ | 60.30%↓ | 51.30%↑ | 38.70%↓ | 34.60%↓ | 71.50%↓ | 36.90%↓ | 43.90%↓ | 58.40%↓ | |
| compensation | 99.7%↓ | 85.80%↓ | 61.50%↓ | 32.10%↓ | 42.50%↓ | 38.70%↓ | 23.70%↓ | 21.50%↓ | 58.80%↓ | 23.70%↓ | 22.50%↓ | 46.41%↓ | |
| GPT-4o | visual | 100.00% | 89.40% | 80.90% | 55.10% | 77.80% | 59.20% | 61.00% | 41.70% | 77.10% | 52.40% | 44.90% | 67.23% |
| fusion | 100.00% | 90.50%↑ | 80.10%↓ | 57.49%↑ | 70.70%↓ | 55.00%↓ | 56.60%↓ | 39.10%↓ | 74.80%↓ | 54.50%↑ | 35.20%↓ | 64.91%↓ | |
| compensation | 100.00% | 87.60%↓ | 70.60%↓ | 45.50%↓ | 52.90%↓ | 47.40%↓ | 44.90%↓ | 31.90%↓ | 67.20%↓ | 38.10%↓ | 22.30%↓ | 55.31%↓ | |
| Gemini 2.5 Flash | visual | 99.50% | 88.60% | 81.40% | 53.90% | 67.90% | 56.90% | 56.50% | 44.20% | 82.10% | 58.10% | 45.80% | 66.81% |
| fusion | 99.60%↑ | 95.20%↑ | 84.00%↑ | 58.40%↑ | 73.00%↑ | 61.40%↑ | 64.90%↑ | 52.60%↑ | 84.30%↑ | 65.80%↑ | 46.70%↑ | 71.45%↑ | |
| compensation | 99.90%↑ | 94.90%↑ | 75.10%↓ | 51.50%↓ | 61.80%↓ | 58.30%↓ | 55.20%↓ | 43.20%↓ | 75.60%↓ | 55.40%↓ | 32.60%↓ | 63.95%↓ | |
This section presents fine-grained degradation and compensation question-answering statistics. Performance changes exceeding 5% are highlighted.
| MLLMs | Structural-level | Pixel-level | Average | |||||
|---|---|---|---|---|---|---|---|---|
| data mark omission | occlusion | label omission | axis omission | legend omission | blurring | rotation | ||
| Qwen2.5-VL-72B | 42.53% | 53.65% | 46.24% | 46.65% | 50.80% | 44.12% | 46.13% | 47.09% |
| 43.66% | 56.73% | 58.95%↑12.71% | 41.45%↓5.20% | 48.97% | 45.39% | 45.29% | 47.73% | |
| GPT-4o | 49.37% | 59.21% | 53.11% | 54.62% | 60.07% | 55.23% | 48.40% | 54.33% |
| 52.87% | 68.10%↑8.89% | 71.72%↑18.61% | 55.51% | 65.23%↑5.16% | 59.59% | 52.51% | 60.13%↑5.80% | |
| Gemini 2.5 Flash | 44.92% | 59.11% | 43.77% | 52.61% | 57.62% | 58.48% | 55.41% | 53.07% |
| 65.50%↑20.58% | 81.44%↑22.34% | 83.62%↑39.85% | 56.43% | 64.71%↑7.09% | 68.68%↑10.20% | 64.90%↑9.48% | 67.71%↑14.65% | |
| Average | 45.61% | 57.32% | 47.71% | 51.29% | 56.17% | 52.61% | 49.98% | |
| 54.01%↑8.40% | 68.76%↑11.44% | 71.43%↑23.72% | 51.13% | 59.64% | 57.88%↑5.28% | 54.23% | ||
To supplement the quantitative metrics, we conduct a qualitative analysis of explicit reasoning traces. We contrast successful alignment examples with failure cases to illustrate the spectrum of MLLM reasoning capabilities and pinpoint common modes of error, such as semantic drift and hallucination.
For deconstructor selection, we compared five representative models covering two mainstream architectures: traditional neural network models (DistilBERT, RoBERTa) and large language models (Llama-4-Scout-17B, Qwen3-32B, DeepSeek-R1).
传统模型 (Traditional Models): 🤖 Modeled as text classification problems (e.g., 4-class classification for reasoning unit parsing, multi-label classification for chart element parsing).
大语言模型 (LLMs): ✨ Utilize a prompt-driven deconstruction approach combined with the Test-Time Scaling (TTS) strategy (performing 5 rounds of sampling with Self-Consistency aggregation).
The experimental results (Table 1) show that traditional models excel in computational efficiency ($time < 0.5s$, no API cost) but exhibit lower deconstruction accuracy ($\mathcal{U}{\text{Acc}}$ only up to $63.04\%$). In contrast, the three LLMs show superior accuracy ($\mathcal{U}{\text{Acc}} > 72\%$, $\mathcal{E}_{\text{Acc}} \approx 80\%$). DeepSeek-R1-Distill-Qwen-32B is ultimately selected as the formal deconstructor due to its best cost-performance ratio.
| Methods | Evaluation Metrics | TTS Metrics | ||||||
|---|---|---|---|---|---|---|---|---|
| 𝕌Acc | ℰAcc | time (s) | 𝕀Δ (MB) | tokens (k) | Pass@1 | Cons@5 | ||
| Traditional | DistilBERT | 16.35% | 66.95% | 0.22 | 0.63 | -- | -- | -- |
| RoBERTa | 63.04% | 60.40% | 0.45 | 1.18 | -- | -- | -- | |
| LLMs | Llama-4-Scout-17B-16E-Instruct | 72.44% | 78.13% | 10.25 | -- | 2.77 | 0.780 | 0.780 |
| Qwen3-32B | 72.47% | 79.65% | 14.91 | -- | 2.84 | 0.784 | 0.783 | |
| DeepSeek-R1-Distill-Qwen-32B (adopted) | 72.14% | 80.69% | 14.23 | -- | 2.84 | 0.790 | 0.784 | |
This section investigates Reasoning Paradigm Transfer by validating the efficacy of high-quality reasoning traces as a supervisory signal.
We designate Gemini 2.5 Flash (based on its superior performance) as the Teacher Model. We then perform supervised fine-tuning on Qwen2.5-VL-7B (the Student Model) using the Low-Rank Adaptation (LoRA) strategy. This aims to align the student model’s logical style with the teacher’s, enhancing its problem-solving capabilities.
The table summarizes the substantial performance gains achieved by the student model across tasks and chart types after fine-tuning.
| Chart Type | Qwen2.5-VL-7B (Pre-trained) | Qwen2.5-VL-7B (LoRA Fine-tuned) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SRP | EVJ | SC | MSR | VA | SRP | EVJ | SC | MSR | VA | |
| bar | 90.00% | 100.00% | 85.00% | 70.00% | 90.00% | 90.00% | 100.00% | 85.00% | 65.00% | 100.00% |
| scatter | - | 90.00% | 65.00% | 45.00% | 55.00% | - | 95.00% | 85.00% | 45.00% | 75.00% |
| Average | 68.33% | 83.57% | 47.50% | 53.75% | 73.33% | 88.33% | 89.29% | 67.50%* | 64.38%* | 86.67% |
If you find this work useful for your research, please cite our paper:
@article{ChartMind2025li,
title={ChartMind: Benchmark and Deconstruction for Multimodal Chart Reasoning},
author={Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Zhentao Zheng, Qi Jiang, Haixia Wang and Ronghua Liang},
year={2025}
}
For questions about this work, please contact: Tong Li (litong@zjut.edu.cn, https://tongli97.github.io/)