ChartMind

ChartMind : Benchmark and Deconstruction for
Multimodal Chart Reasoning

Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Zhentao Zheng,
Qi Jiang, Haixia Wang, Ronghua Liang

Zhejiang University of Technology Zhejiang University of Science and Technology

📑 Paper ⭐ Code 🧱 Dataset
ChartMind Overview

Abstract

Existing MLLMs chart reasoning evaluations are limited to single visual inputs and “black-box” metrics, lacking “white-box” analysis of thinking processes and reasoning patterns. To alleviate this, we present Mega60K, a benchmark covering 21 chart types and 11 question-answer tasks, enriched with reasoning traces. We further introduce the reasoning deconstruction framework to quantify multimodal activation strategies. Additionally, we conduct the reasoning paradigm transfer experiment to explore logic alignment via reasoning supervision. Evaluations across 12 representative MLLMs under three reasoning settings—visual-only, multimodal fusion, and multimodal compensation—yield the following insights: High-level reasoning tasks (e.g., multi-step logic, pattern recognition, layout optimization) serve as gold-standard for distinguishing MLLMs; Naive modality stacking offers limited reasoning gains, while structured modalities yield measurable compensatory effects in quantitative tasks under visual degradation. Reasoning paradigm transfer improves the student model (Qwen2.5-VL-7B) by 18.77% in reasoning accuracy and enables alignment with the teacher model’s (Gemini 2.5 Flash) reasoning style.

Mega60k Dataset Overview 📊

21 Chart Types

21 Chart Types in ChartMind

11 Question Tasks

ChartMind Evaluation Space 📈

Evaluation Metrics

We employ six key metrics to comprehensively assess model performance:

Benchmarking Configurations

To evaluate the multimodal reasoning capabilities of MLLMs, we design three experimental configurations:

MLLMs Eaperiment CTR VEC SRP VPR VE EVJ SC NF NC MSR VA Average
Tacc Racc Tacc Tacc/Racc Tacc/Racc Tacc/Racc Racc Macc Tacc/Racc Tacc/Racc Tacc/Racc
open-source models
CogVLM2 visual 5.80% 15.30% 16.20% 7.30% 7.00% 5.20% 2.90% 4.50% 35.90% 5.80% 11.40% 10.66%
LLaVA 1.5 visual 9.70% 49.70% 13.60% 2.30% 2.00% 1.80% 2.10% 2.70% 37.70% 3.90% 7.40% 12.08%
DeepSeek-VL2 visual 85.80% 74.80% 52.90% 31.00% 40.50% 32.70% 11.90% 16.00% 49.30% 19.20% 28.30% 40.22%
InternVL3 visual 75.90% 84.10% 57.10% 29.59% 58.10% 45.70% 35.40% 21.50% 64.60% 22.60% 27.80% 47.49%
fusion 76.50% 75.60% 22.80% 4.00% 10.10% 5.10% 40.70% 18.50% 49.50% 4.10% 1.50% 28.04%
compensation 88.10% 88.00% 50.00% 23.3% 75.20% 48.00% 55.80% 17.10% 63.00% 8.10% 45.50% 51.10%
LLaMA4 Maverick visual 100.00% 84.90% 73.60% 46.30% 56.80% 49.70% 47.50% 28.40% 71.60% 39.10% 38.80% 57.88%
fusion 100.00% 84.40% 48.80% 23.90% 25.50% 19.50% 22.40% 10.70% 56.80% 19.70% 19.90% 39.24%
compensation 99.90% 86.30% 63.30% 36.70% 41.70% 38.90% 32.60% 18.80% 62.30% 30.40% 22.40% 48.48%
Qwen2.5-VL-32B visual 99.10% 84.90% 69.60% 40.60% 53.90% 45.10% 37.20% 24.40% 64.60% 36.10% 36.40% 53.81%
fusion 99.10% 84.90% 68.30% 38.40% 48.80% 43.50% 33.80% 21.70% 63.10% 32.90% 34.30% 51.71%
compensation 99.10% 81.90% 57.60% 27.70% 37.30% 33.60% 25.70% 14.20% 56.70% 25.70% 29.40% 44.45%
Qwen2.5-VL-72B visual 99.80% 85.00% 69.50% 39.80% 58.50% 47.00% 43.30% 23.80% 67.80% 37.30% 34.30% 55.10%
fusion 99.40% 86.70% 66.30% 36.10% 52.30% 44.50% 37.80% 20.80% 64.60% 32.80% 34.60% 52.35%
compensation 99.00% 84.20% 57.80% 26.90% 38.00% 35.70% 27.90% 13.40% 58.00% 25.10% 28.50% 44.95%
closed-source models
Claude 3.5 Haiku visual 100.00% 85.20% 67.10% 40.00% 53.40% 46.50% 33.90% 23.00% 65.60% 32.30% 32.50% 52.68%
fusion 100.00% 87.80% 65.10% 38.50% 51.30% 45.30% 31.40% 22.50% 62.80% 33.00% 29.90% 51.60%
compensation 99.60% 91.00% 57.90% 27.20% 37.60% 40.10% 22.60% 18.20% 56.40% 28.90% 24.00% 45.77%
GLM-4V-Plus visual 99.90% 85.50% 73.70% 47.00% 63.80% 53.30% 45.00% 27.30% 76.20% 42.30% 40.20% 59.47%
fusion 99.70% 81.40% 62.20% 38.10% 46.50% 42.80% 33.90% 18.70% 68.30% 32.40% 35.90% 50.90%
compensation 99.30% 80.60% 54.50% 31.10% 38.40% 35.00% 28.30% 16.30% 63.10% 28.10% 27.30% 45.64%
Doubao 1.5 Vision-Pro visual 100.00% 86.00% 75.60% 44.20% 63.10% 48.80% 41.30% 35.20% 72.60% 42.20% 46.30% 59.57%
fusion 100.00% 87.20% 73.10% 44.90% 60.30% 51.30% 38.70% 34.60% 71.50% 36.90% 43.90% 58.40%
compensation 99.7% 85.80% 61.50% 32.10% 42.50% 38.70% 23.70% 21.50% 58.80% 23.70% 22.50% 46.41%
GPT-4o visual 100.00% 89.40% 80.90% 55.10% 77.80% 59.20% 61.00% 41.70% 77.10% 52.40% 44.90% 67.23%
fusion 100.00% 90.50% 80.10% 57.49% 70.70% 55.00% 56.60% 39.10% 74.80% 54.50% 35.20% 64.91%
compensation 100.00% 87.60% 70.60% 45.50% 52.90% 47.40% 44.90% 31.90% 67.20% 38.10% 22.30% 55.31%
Gemini 2.5 Flash visual 99.50% 88.60% 81.40% 53.90% 67.90% 56.90% 56.50% 44.20% 82.10% 58.10% 45.80% 66.81%
fusion 99.60% 95.20% 84.00% 58.40% 73.00% 61.40% 64.90% 52.60% 84.30% 65.80% 46.70% 71.45%
compensation 99.90% 94.90% 75.10% 51.50% 61.80% 58.30% 55.20% 43.20% 75.60% 55.40% 32.60% 63.95%

Degradation and Compensation Reasoning

This section presents fine-grained degradation and compensation question-answering statistics. Performance changes exceeding 5% are highlighted.

MLLMs Structural-level Pixel-level Average
data mark omission occlusion label omission axis omission legend omission blurring rotation
Qwen2.5-VL-72B 42.53% 53.65% 46.24% 46.65% 50.80% 44.12% 46.13% 47.09%
43.66% 56.73% 58.95%↑12.71% 41.45%↓5.20% 48.97% 45.39% 45.29% 47.73%
GPT-4o 49.37% 59.21% 53.11% 54.62% 60.07% 55.23% 48.40% 54.33%
52.87% 68.10%↑8.89% 71.72%↑18.61% 55.51% 65.23%↑5.16% 59.59% 52.51% 60.13%↑5.80%
Gemini 2.5 Flash 44.92% 59.11% 43.77% 52.61% 57.62% 58.48% 55.41% 53.07%
65.50%↑20.58% 81.44%↑22.34% 83.62%↑39.85% 56.43% 64.71%↑7.09% 68.68%↑10.20% 64.90%↑9.48% 67.71%↑14.65%
Average 45.61% 57.32% 47.71% 51.29% 56.17% 52.61% 49.98%
54.01%↑8.40% 68.76%↑11.44% 71.43%↑23.72% 51.13% 59.64% 57.88%↑5.28% 54.23%
CV Overview

Reasoning Instances: Qualitative Analysis 🔎

To supplement the quantitative metrics, we conduct a qualitative analysis of explicit reasoning traces. We contrast successful alignment examples with failure cases to illustrate the spectrum of MLLM reasoning capabilities and pinpoint common modes of error, such as semantic drift and hallucination.

Positive Reasoning Instances

Successful reasoning traces demonstrating correct logical alignment and evidence utilization.

Negative Reasoning Instances

Negative reasoning traces illustrating semantic drift and hallucination.

Reasoning Deconstruction, Analysis and Transfer

Reasoning Experiment

For deconstructor selection, we compared five representative models covering two mainstream architectures: traditional neural network models (DistilBERT, RoBERTa) and large language models (Llama-4-Scout-17B, Qwen3-32B, DeepSeek-R1).

The experimental results (Table 1) show that traditional models excel in computational efficiency ($time < 0.5s$, no API cost) but exhibit lower deconstruction accuracy ($\mathcal{U}{\text{Acc}}$ only up to $63.04\%$). In contrast, the three LLMs show superior accuracy ($\mathcal{U}{\text{Acc}} > 72\%$, $\mathcal{E}_{\text{Acc}} \approx 80\%$). DeepSeek-R1-Distill-Qwen-32B is ultimately selected as the formal deconstructor due to its best cost-performance ratio.

Methods Evaluation Metrics TTS Metrics
𝕌Acc Acc time (s) 𝕀Δ (MB) tokens (k) Pass@1 Cons@5
Traditional DistilBERT 16.35% 66.95% 0.22 0.63 -- -- --
RoBERTa 63.04% 60.40% 0.45 1.18 -- -- --
LLMs Llama-4-Scout-17B-16E-Instruct 72.44% 78.13% 10.25 -- 2.77 0.780 0.780
Qwen3-32B 72.47% 79.65% 14.91 -- 2.84 0.784 0.783
DeepSeek-R1-Distill-Qwen-32B (adopted) 72.14% 80.69% 14.23 -- 2.84 0.790 0.784
CV Overview

Reasoning Paradigm Transfer

This section investigates Reasoning Paradigm Transfer by validating the efficacy of high-quality reasoning traces as a supervisory signal.

We designate Gemini 2.5 Flash (based on its superior performance) as the Teacher Model. We then perform supervised fine-tuning on Qwen2.5-VL-7B (the Student Model) using the Low-Rank Adaptation (LoRA) strategy. This aims to align the student model’s logical style with the teacher’s, enhancing its problem-solving capabilities.

The table summarizes the substantial performance gains achieved by the student model across tasks and chart types after fine-tuning.

Chart Type Qwen2.5-VL-7B (Pre-trained) Qwen2.5-VL-7B (LoRA Fine-tuned)
SRP EVJ SC MSR VA SRP EVJ SC MSR VA
bar 90.00% 100.00% 85.00% 70.00% 90.00% 90.00% 100.00% 85.00% 65.00% 100.00%
scatter - 90.00% 65.00% 45.00% 55.00% - 95.00% 85.00% 45.00% 75.00%
Average 68.33% 83.57% 47.50% 53.75% 73.33% 88.33% 89.29% 67.50%* 64.38%* 86.67%

Citation

If you find this work useful for your research, please cite our paper:

@article{ChartMind2025li,
    title={ChartMind: Benchmark and Deconstruction for Multimodal Chart Reasoning},
    author={Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Zhentao Zheng, Qi Jiang, Haixia Wang and Ronghua Liang},
    year={2025}
}

For questions about this work, please contact: Tong Li (litong@zjut.edu.cn, https://tongli97.github.io/)