ChartMind

ChartMind : Benchmark and Reasoning Insights of Multimodal Chart Question Answering

Tong Li^†, Guodao Sun^†, Shunkai Wang^†, Zuoyu Tang^†, Yang Shu^†, Xueqian Zheng^†, Haixia Wang^†, Ronghua Liang^‡

^†Zhejiang University of Technology ^‡Zhejiang University of Science and Technology

Abstract

Existing ChartQA evaluations for multimodal large language models focuses on visual-only input, and rely solely on “black-box” accuracy metrics, offering limited insight into reasoning traces. To fill these gaps, we introduce Mega60k, a benchmark covering 21 chart types and 11 QA tasks; collect ChartQA reasoning traces from MLLMs; and propose the reasoning deconstruction framework to parse multimodal activation patterns and reasoning evidence usage. Evaluating 12 representative MLLMs (7 open-source and 5 closed-source) under three conditions—visual-only, multimodal fusion, and multimodal compensation—reveals key findings: High-level tasks (multi-step logic, visual pattern recognition, layout optimization) serve as gold-standard for distinguishing MLLMs; Mere modality stacking struggles to extend reasoning boundaries but shows compensatory potential in quantitative visual understanding tasks; Gemini 2.5 Flash and GPT-4o demonstrate positive signals in leveraging structured modality reasoning to mitigate visual degradation such as omissions, occlusion, blurring, and rotation.

Mega60k Overview

21 Chart Types

11 Question Tasks

Visual Understanding - chart type recognition (CTR), visual element counting (VEC), spatial relation perception (SRP), and visual pattern recognition (VPR)
Numerical Analysis - numerical extraction (NE), extremum value judgment (EVJ), statistical computation (SC), numerical filtering (NF), and numerical comparison (NC)
Logical Reasoning - multi-step reasoning (MSR) and visual analysis (VA)

ChartMind Evaluation Space

Evaluation Metrics

We employ six key metrics to comprehensively assess model performance:

TightAcc (T_acc): Exact keyword matching between model answers and ground truth for strict factual accuracy assessment.
RelaxAcc (R_acc): Numerical accuracy with 5% error tolerance to account for unit conversion and precision variations.
MixAcc (M_acc): Multiplicative combination of TightAcc and RelaxAcc for answers containing both textual and numerical elements.
Inference Time: Time cost (in seconds) from input reception to answer generation.
Reasoning Tokens: Token count in the model’s explicit reasoning trace, reflecting thinking expansion degree.
Reasoning Drift: Semantic deviation from question-answering context, measured using Sentence-BERT similarity between reasoning units and task context.

Multimodal Evaluation

To evaluate the multimodal reasoning capabilities of MLLMs, we design three experimental configurations:

Visual: Chart image + question → answer + reasoning (baseline)
Fusion: Chart image + SVG + question → answer + reasoning
Compensation: Degraded chart image + SVG + question → answer + reasoning

MLLMs	Eaperiment	CTR	VEC	SRP	VPR	VE	EVJ	SC	NF	NC	MSR	VA	Average
MLLMs	Eaperiment	T_acc	R_acc	T_acc	T_acc/R_acc	T_acc/R_acc	T_acc/R_acc	R_acc	M_acc	T_acc/R_acc	T_acc/R_acc	T_acc/R_acc	Average
open-source models
CogVLM2	visual	5.80%	15.30%	16.20%	7.30%	7.00%	5.20%	2.90%	4.50%	35.90%	5.80%	11.40%	10.66%
LLaVA 1.5	visual	9.70%	49.70%	13.60%	2.30%	2.00%	1.80%	2.10%	2.70%	37.70%	3.90%	7.40%	12.08%
DeepSeek-VL2	visual	85.80%	74.80%	52.90%	31.00%	40.50%	32.70%	11.90%	16.00%	49.30%	19.20%	28.30%	40.22%
InternVL3	visual	75.90%	84.10%	57.10%	29.59%	58.10%	45.70%	35.40%	21.50%	64.60%	22.60%	27.80%	47.49%
	fusion	76.50%↑	75.60%↓	22.80%↓	4.00%↓	10.10%↓	5.10%↓	40.70%↑	18.50%↓	49.50%↓	4.10%↓	1.50%↓	28.04%↓
	compensation	88.10%↑	88.00%↑	50.00%↓	23.3%↓	75.20%↑	48.00%↑	55.80%↑	17.10%↓	63.00%↓	8.10%↓	45.50%↑	51.10%↑
LLaMA4 Maverick	visual	100.00%	84.90%	73.60%	46.30%	56.80%	49.70%	47.50%	28.40%	71.60%	39.10%	38.80%	57.88%
	fusion	100.00%	84.40%↓	48.80%↓	23.90%↓	25.50%↓	19.50%↓	22.40%↓	10.70%↓	56.80%↓	19.70%↓	19.90%↓	39.24%↓
	compensation	99.90%↓	86.30%↑	63.30%↓	36.70%↓	41.70%↓	38.90%↓	32.60%↓	18.80%↓	62.30%↓	30.40%↓	22.40%↓	48.48%↓
Qwen2.5-VL-32B	visual	99.10%	84.90%	69.60%	40.60%	53.90%	45.10%	37.20%	24.40%	64.60%	36.10%	36.40%	53.81%
	fusion	99.10%	84.90%	68.30%↓	38.40%↓	48.80%↓	43.50%↓	33.80%↓	21.70%↓	63.10%↓	32.90%↓	34.30%↓	51.71%↓
	compensation	99.10%	81.90%↓	57.60%↓	27.70%↓	37.30%↓	33.60%↓	25.70%↓	14.20%↓	56.70%↓	25.70%↓	29.40%↓	44.45%↓
Qwen2.5-VL-72B	visual	99.80%	85.00%	69.50%	39.80%	58.50%	47.00%	43.30%	23.80%	67.80%	37.30%	34.30%	55.10%
	fusion	99.40%↓	86.70%↑	66.30%↓	36.10%↓	52.30%↓	44.50%↓	37.80%↓	20.80%↓	64.60%↓	32.80%↓	34.60%↑	52.35%↓
	compensation	99.00%↓	84.20%↓	57.80%↓	26.90%↓	38.00%↓	35.70%↓	27.90%↓	13.40%↓	58.00%↓	25.10%↓	28.50%↓	44.95%↓
closed-source models
Claude 3.5 Haiku	visual	100.00%	85.20%	67.10%	40.00%	53.40%	46.50%	33.90%	23.00%	65.60%	32.30%	32.50%	52.68%
	fusion	100.00%	87.80%↑	65.10%↓	38.50%↓	51.30%↓	45.30%↓	31.40%↓	22.50%↓	62.80%↓	33.00%↑	29.90%↓	51.60%↓
	compensation	99.60%↓	91.00%↑	57.90%↓	27.20%↓	37.60%↓	40.10%↓	22.60%↓	18.20%↓	56.40%↓	28.90%↓	24.00%↓	45.77%↓
GLM-4V-Plus	visual	99.90%	85.50%	73.70%	47.00%	63.80%	53.30%	45.00%	27.30%	76.20%	42.30%	40.20%	59.47%
	fusion	99.70%↓	81.40%↓	62.20%↓	38.10%↓	46.50%↓	42.80%↓	33.90%↓	18.70%↓	68.30%↓	32.40%↓	35.90%↓	50.90%↓
	compensation	99.30%↓	80.60%↓	54.50%↓	31.10%↓	38.40%↓	35.00%↓	28.30%↓	16.30%↓	63.10%↓	28.10%↓	27.30%↓	45.64%↓
Doubao 1.5 Vision-Pro	visual	100.00%	86.00%	75.60%	44.20%	63.10%	48.80%	41.30%	35.20%	72.60%	42.20%	46.30%	59.57%
	fusion	100.00%	87.20%↑	73.10%↓	44.90%↑	60.30%↓	51.30%↑	38.70%↓	34.60%↓	71.50%↓	36.90%↓	43.90%↓	58.40%↓
	compensation	99.7%↓	85.80%↓	61.50%↓	32.10%↓	42.50%↓	38.70%↓	23.70%↓	21.50%↓	58.80%↓	23.70%↓	22.50%↓	46.41%↓
GPT-4o	visual	100.00%	89.40%	80.90%	55.10%	77.80%	59.20%	61.00%	41.70%	77.10%	52.40%	44.90%	67.23%
	fusion	100.00%	90.50%↑	80.10%↓	57.49%↑	70.70%↓	55.00%↓	56.60%↓	39.10%↓	74.80%↓	54.50%↑	35.20%↓	64.91%↓
	compensation	100.00%	87.60%↓	70.60%↓	45.50%↓	52.90%↓	47.40%↓	44.90%↓	31.90%↓	67.20%↓	38.10%↓	22.30%↓	55.31%↓
Gemini 2.5 Flash	visual	99.50%	88.60%	81.40%	53.90%	67.90%	56.90%	56.50%	44.20%	82.10%	58.10%	45.80%	66.81%
	fusion	99.60%↑	95.20%↑	84.00%↑	58.40%↑	73.00%↑	61.40%↑	64.90%↑	52.60%↑	84.30%↑	65.80%↑	46.70%↑	71.45%↑
	compensation	99.90%↑	94.90%↑	75.10%↓	51.50%↓	61.80%↓	58.30%↓	55.20%↓	43.20%↓	75.60%↓	55.40%↓	32.60%↓	63.95%↓

Statistical analysis based on the coefficient of variation:

Degradation and Compensation Evaluation

This section presents fine-grained degradation and compensation question-answering statistics. Performance changes exceeding 5% are highlighted.

MLLMs	Structural-level					Pixel-level		Average
MLLMs	data mark omission	occlusion	label omission	axis omission	legend omission	blurring	rotation	Average
Qwen2.5-VL-72B	42.53%	53.65%	46.24%	46.65%	50.80%	44.12%	46.13%	47.09%
Qwen2.5-VL-72B	43.66%	56.73%	58.95%↑12.71%	41.45%↓5.20%	48.97%	45.39%	45.29%	47.73%
GPT-4o	49.37%	59.21%	53.11%	54.62%	60.07%	55.23%	48.40%	54.33%
GPT-4o	52.87%	68.10%↑8.89%	71.72%↑18.61%	55.51%	65.23%↑5.16%	59.59%	52.51%	60.13%↑5.80%
Gemini 2.5 Flash	44.92%	59.11%	43.77%	52.61%	57.62%	58.48%	55.41%	53.07%
Gemini 2.5 Flash	65.50%↑20.58%	81.44%↑22.34%	83.62%↑39.85%	56.43%	64.71%↑7.09%	68.68%↑10.20%	64.90%↑9.48%	67.71%↑14.65%
Average	45.61%	57.32%	47.71%	51.29%	56.17%	52.61%	49.98%
Average	54.01%↑8.40%	68.76%↑11.44%	71.43%↑23.72%	51.13%	59.64%	57.88%↑5.28%	54.23%

Degradation degree and question-answering accuracy statistics:

Reasoning Instances and Deconstruction

We employ a Test-Time Scaling-based framework to achieve fine-grained deconstruction of reasoning traces, analyzing both modality classification and chart component evidence dependencies in MLLM reasoning processes.

Positive Reasoning Instances:

Negative Reasoning Instances:

Modality Activation Frequency and Temporal Patterns:

Chart Component Dependency Patterns:

Citation

If you find this work useful for your research, please cite our paper:

@article{ChartMind2025li,
    title={ChartMind: Benchmark and Reasoning Insights of Multimodal Chart Question Answering},
    author={Tong Li, Guodao Sun, Shunkai Wang, Zuoyu Tang, Yang Shu, Xueqian Zheng, Haixia Wang and Ronghua Liang},
    year={2025}
}

For questions about this work, please contact: Tong Li (litong@zjut.edu.cn, https://tongli97.github.io/)