Paper Content:
Page 1:
VisualWebInstruct: Scaling up Multimodal Instruction Data
through Web Search
k,tYiming Jia*,nJiachen Li,mXiang Yue,zBo Li,xPing Nie,yKai Zou,kWenhu Chen
kUniversity of Waterloo,tUniversity of Toronto,nUC Santa Barbara,
mCMU,zNUS,xIndependent,yNetmind.ai
{yiming.jia@mail.utoronto.ca, wenhuchen@uwaterloo.ca }
https://tiger-ai-lab.github.io/VisualWebInstruct
Abstract
Vision-Language Models have made significant progress on
many perception-focused tasks, however, their progress on
reasoning-focused tasks seem to be limited due to the lack
of high-quality and diverse training data. In this work,
we aim to address the scarcity issue of reasoning-focused
multimodal datasets. We propose VisualWebInstruct — a
novel approach that leverages search engine to create a
diverse, and high-quality dataset spanning multiple disci-
plines like math, physics, finance, chemistry, etc. Starting
with meticulously selected 30,000 seed images, we employ
Google Image search to identify websites containing simi-
lar images. We collect and process the HTMLs from over
700K unique URL sources. Through a pipeline of content
extraction, filtering and synthesis, we build a dataset of
approximately 900K question-answer pairs, with 40% be-
ing visual QA pairs and the rest as text QA pairs. Mod-
els fine-tuned on VisualWebInstruct demonstrate significant
performance gains: (1) training from Llava-OV-mid shows
10-20% absolute point gains across benchmarks, (2) train-
ing from MAmmoTH-VL shows 5% absoluate gain. Our
best model MAmmoTH-VL2 shows state-of-the-art perfor-
mance within the 10B parameter class on MMMU-Pro-
std (40.7%), MathVerse (42.6%), and DynaMath (55.7%).
These remarkable results highlight the effectiveness of our
dataset in enhancing VLMs’ reasoning capabilities for com-
plex multimodal tasks.
1. Introduction
Vision-Language Models (VLMs), such as Llava [33] and
Gemini [49], are designed to process multimodal inputs,
including images, videos, and text. While VLMs have
recently demonstrated significant progress in straightfor-
ward perceptual tasks such as VQA [3], DocVQA [41],
*Published during an internship at University of Waterloo
Seed images
Google Lens
ExtractWeb pagesVisualWebInstruct50.13354.830.242.152.13758.228.240.554.740.768.142.655.7MMMUMMMU-PROMATHVISTAMATHVERSEDYNA-MATHLlava-CoTQwen2-VLMAmmoTH-VL2Scaling the Visual Reasoning Data Through Google Lens
Figure 1. Overview of our automated data curation approach and
major experimental results.
and VizWiz [20], they often struggle with more complex
tasks such as MMMU [62], MathVista [38], and MEGA-
Bench [8], which require multi-step, deliberate reason-
ing [15, 54]. One major bottleneck for existing VLMs is
the scarcity of reasoning-focused training datasets. Current
multimodal reasoning datasets exhibit several limitations:
(1) Many datasets, such as FigureQA [25], MapQA [6],
GeoQA [7], and ChartQA [40], focus narrowly on spe-
cific types of scientific images. (2) Some datasets rely on
synthetic images generated through predefined rules, such
as CLEVR [24] and Geo170K [16], which often result in
poor generalization to real-world visual reasoning tasks.
(3) Other training datasets, such as AI2D [26] and Sci-
enceQA [45], are relatively small and simplistic, primarily
covering elementary-level visual knowledge. Due to these
limitations, VLMs fail to acquire diverse reasoning skills,
leading to slower progress on reasoning-intensive bench-
marks compared to language models.
1arXiv:2503.10582v1 [cs.CV] 13 Mar 2025
Page 2:
Dataset Size Source & Domains Coverage
ScienceQA [45] 21K Elementary and high school science Science Q&A, diagrams, K-12 Exam
IconQA [37] 107K Abstract diagrams and visual reasoning Visual reasoning, diagrams
Geo170K [16] 170K Synthesized from LLMs Geometry
CLEVR [24] 700K Synthesized from rules Shapes
FigureQA [25] 1.3M Synthesized from rules Bar, Line, Pie
ChartQA [40] 23K Charts from Staista, Pew, etc Charts
Math360V [47] 260K FigureQA [25], CLEVR [24], IconQA [37], etc Math reasoning, diagrams
Mulberry [60] 260K Geo3K [36], IconQA [37], ChartQA [40], ScienceQA [45], etc Geo, Figure, Medical, K-12 Exam
Llava-CoT [57] 100K ChartQA [40], AI2D [26], GeoQA [7], CLEVR [24], etc Geo, General VQA, K-12 Exam
VISUAL WEBINSTRUCT 906K Internet (Homework Website, Forums, etc) All Above + College Exams
Table 1. Comparison between our dataset and the existing datasets. V ISUAL WEBINSTRUCT is the most diverse dataset with very broad
coverage of disciplines and image types.
Given the difficulty of human annotation, we draw in-
spiration from WebInstruct [64] to mine naturally exist-
ing reasoning-focused instruction data from the internet.
While WebInstruct retrieves reasoning-focused text data
from Common Crawl1, their approach is infeasible for the
multimodal domain due to two key challenges: (1) the lack
of a comparable large-scale multimodal dataset, and (2) the
unreliability of current multimodal retrieval models. To ad-
dress these challenges, we leverage commercial web image
search tools, such as Google Image Search, which offer high
coverage and accuracy.
We begin by collecting approximately 30,000 seed im-
ages across multiple disciplines, including Accounting,
Chemistry, Mathematics, and Physics. These images serve
as queries for Google Image Search [67] to identify web-
sites containing similar images. We then download the
HTMLs from these websites and extract their accessibility
trees, which are processed by an LLM to extract QA pairs
(if any) for an initial dataset. However, we found that over
half of the extracted questions lack annotated answers due
to three primary reasons: (1) these websites do not provide
answers, (2) some require membership to access, and (3)
some necessitate user interaction to reveal the answers. To
address this, we use GPT-4o [22] to synthesize multiple can-
didate solutions for each question, filtering for consistency
among responses. Finally, we align the selected answers
with the content from original webpage to remove poten-
tial inaccurate ones. Through this sophisticated process, we
construct V ISUAL WEBINSTRUCT , a dataset containing ap-
proximately 900K QA pairs, where 40% are visual QA pairs
associated with 163,743 unique images, while the remain-
ing 60% are text-only QA pairs. Most of them are exam-like
problems requiring deliberate reasoning.
Table 1 compares V ISUAL WEBINSTRUCT with other
datasets in terms of source and coverage. Our dataset
comprises highly diverse, human-created scientific images
1https://commoncrawl.org/spanning multiple disciplines and levels of complexity. Its
broad coverage and increased difficulty make it particu-
larly well-suited for improving VLM performance on real-
world tasks requiring multi-step reasoning. To evaluate the
effectiveness of V ISUAL WEBINSTRUCT , we perform su-
pervised fine-tuning on MAmmoTH-VL [19] and Llava-
OV-mid [28]. Comprehensive evaluations across seven
visual reasoning benchmarks, including MMMU [62],
MathVista [38], and Dyna-Math [70], demonstrate sub-
stantial performance gains. When fine-tuning Llava-
OV-mid, we observe an absolute improvement of 10–
20 percentage points across these benchmarks. When
fine-tuning MAmmoTH-VL, our model MAmmoTH-VL2
achieves state-of-the-art performance (within the 10B pa-
rameter range) on several benchmarks, including MMMU-
Pro-std (40.7%), MMVet (64.5%), MathVerse (42.6%), and
Dyna-Math (55.7%). MAmmoTH-VL2 ’s average perfor-
mance across seven benchmarks surpasses strong competi-
tors such as InternVL2.5 [9] and Phi-4-Mini [1], underscor-
ing the effectiveness of V ISUAL WEBINSTRUCT in enhanc-
ing VLMs’ reasoning capabilities.
Our contributions can be summarized as follows:
• We propose a scalable pipeline for acquiring high-quality
multimodal reasoning data from the internet, ensuring
both scalability and quality.
• We introduce V ISUAL WEBINSTRUCT , a diverse and
comprehensive multimodal instruction dataset, which we
will publicly release to the research community.
• We develop MAmmoTH-VL2 , a 7B-parameter vision-
language model fine-tuned on V ISUAL WEBINSTRUCT ,
achieving state-of-the-art performance among models of
comparable size and excelling in complex reasoning tasks
requiring multi-step deliberation with visual context.
In the following sections, we will first talk about how we
mine the data from the Internet in section 2 and then talk
about how to refine it in section 3. Finally, we show our
experimental results in section 4.
2
Page 3:
Seed ImageWebpageAccessibility TreeInitial Dataset
Consistency Filtering
Y/FVerified Answer
VisualWebInstruct
Post-ProcessStage 1:Stage 2:Figure 2. Comprehensive Pipeline for V ISUAL WEBINSTRUCT Dataset Generation. The workflow illustrates our multi-stage approach for
creating high-quality multimodal instruction data. Stage 1 : starting with seed images, we leverage Google Image search to identify relevant
webpages, which are processed into accessibility trees. The raw QA pairs are extracted from the trees and refined through a post-processing
step to ensure the vadality the data. Stage 2 : we first generat multiple synthesized answers for consistency filtering, then align these with
original web-sourced content to enhance the accuracy of the answers.
2. Stage 1: Mining Data from the Internet
Our data mining pipeline follows a systematic approach
to extract image-rich QA pairs from the internet. We be-
gin with approximately 30K scientific images as seed data
spanning multiple disciplines. We employ Google Im-
age Search to identify visually similar content, gathering
758,490 unique URLs. After filtering out irrelevant do-
mains, we construct accessibility trees for the relevant web-
sites to extract meaningful content, preserving both textual
and visual information while eliminating non-essential ele-
ments. We then leverage the Gemini 1.5 Flash model in a
two-stage process: first to automatically extract QA pairs
from the accessibility trees and then to filter these pairs
based on comprehensive quality criteria, including question
validity and image relevance, ensuring the educational value
and integrity of the final dataset.
2.1. Seed Data collecting
Due to the limited availability of image-rich QA datasets
and the predominant focus on mathematics in existing
datasets, creating a comprehensive QA dataset that incorpo-
rates diverse subjects and abundant visual content is essen-
tial. Our seed dataset consists of approximately 30,000 im-
ages, which were crawled from Stemez2in compliance with
copyright regulations. These images span multiple disci-
plines, including mathematics, physics, accounting, chem-
istry, engineering, and biology, ensuring both subject diver-
sity and visual richness.
2.2. Google Image Searching
Using the seed images, we conducted Google Image
searches to find visually similar content across the web.
Leveraging Google Lens (Figure 3), we collected approxi-
2https://stemez.com/subjects/science/mately 60 URLs per image, resulting in a total of 1,747,634
URLs containing visually similar content. Many web-
sites with non-permissive licenses implement anti-crawling
mechanisms, and we ensured compliance by avoiding data
collection from such sources. We applied rigorous dedu-
plication and filtering, removing URLs from domains un-
likely to contain educational content (e.g., video platforms
and image repositories). This refinement yielded 758,490
unique, high-quality URLs for further processing. By using
images as primary search keys, we ensured strong visual
and contextual connections between the collected data and
our seed dataset, effectively preserving the original distri-
bution while significantly expanding its coverage.
Query
Figure 3. Example of Google Lens search functionality for circle
geometry problems.
2.3. Accessibility Tree Building
After filtering out irrelevant domains, we processed the
HTML content of each remaining URL to construct acces-
3
Page 4:
sibility trees that capture essential textual and visual infor-
mation. As illustrated in Figure 4, our implementation fo-
cuses on extracting meaningful text content and image ele-
ments while filtering out non-essential components such as
navigation menus, advertisements, and auxiliary elements.
We developed a tree-based structure where each node rep-
resents either textual content or an image, preserving the hi-
erarchical relationships present in the original HTML while
removing unnecessary markup and styling information. The
resulting accessibility trees provide a clean, hierarchical
representation of each webpage’s content, making subse-
quent QA pair extraction more efficient and reliable.
RootTextImageTextQuestion: In the following figure...
Option: A: 75, B: 60, C: 30, D: 45 TextAnswer: Chord AB subtends...
Figure 4. Example of an accessibility tree structure extracted from
an educational website.
2.4. QA Pairs Extraction
After constructing accessibility trees, we prompt the Gem-
ini 1.5 Flash model to identify and extract high-quality QA
pairs from webpage content. We designed a structured
prompt instructing the model to extract complete question
text, identify relevant question-related images, and extract
comprehensive solution details while preserving mathemat-
ical notations and step-by-step explanations. This approach
maintains the educational integrity of the extracted con-
tent by preserving its original formatting, mathematical ex-
pressions, and logical structure, ensuring technical accuracy
throughout the extraction process. Through this method, we
extracted a total of 421,320 raw QA pairs from the web-
pages, with approximately 60% containing images.
We then implemented a post-processing stage using the
Gemini 1.5 Flash model to ensure dataset quality by eval-uating both textual content and images. Our evaluation
framework assessed two key criteria: question validity and
meaningfulness, as well as the relevance and clarity of
question-related images. By prompting Gemini to verify
whether images are properly referenced, clear, visible, and
contribute to understanding the question, we established
strict validation criteria for retaining QA pairs.
This post-processing step significantly improved dataset
quality by removing incomplete, unclear, or irrelevant con-
tent while preserving educational integrity and effective-
ness. Our analysis shows that out of 421,320 processed
pairs, 361,015 (85.7%) were valid, while 60,305 were fil-
tered out as invalid. Similarly, out of 449,859 total images
processed, 331,818 (73.76%) were deemed valid and rele-
vant to their corresponding questions.
3. Stage 2: Dataset Refinement
After Stage 1, we obtain a large amount of raw data from
the Internet. However, this data contains a notable level of
noise. For instance, more than half of the questions lack cor-
responding answers due to various issues, such as (1) mem-
bership requirements, (2) interaction requirements, and (3)
the absence of an answer. Thus, a second round of refine-
ment is necessary to further improve the dataset quality.
3.1. Answer Refinement
We implemented a comprehensive refinement process to en-
sure consistency and quality in our dataset. This step was
critical in addressing potential variations or inconsistencies
in the extracted answers, thereby creating a high-fidelity
dataset for model training.
Our refinement methodology leveraged GPT-4o’s capa-
bilities in a two-stage process. First, for each question and
its associated images, we prompted GPT-4o [22]3to gen-
erate four different answer variations. This approach al-
lowed us to obtain multiple perspectives on each question.
Next, we employed GPT-4o as an LLM judge to determine
whether the synthesized responses aligned with each other.
As illustrated in Figure 5, we evaluated whether the con-
clusions were mutually consistent across these responses.
This evaluation was particularly important for questions in
domains such as mathematics and physics, where precision
and correctness are paramount. Only when more than half
of the synthesized responses demonstrated consistency did
we retain the question along with the consistent responses.
This rigorous consistency check served as an additional
quality filter, ensuring that our dataset contained highly ac-
curate and unambiguous answers that could be reliably used
for model training.
Through this refinement process, we successfully cre-
ated a dataset in which all responses were systematically
3We compared GPT-4o and Gemini-1.5 and found that GPT-4o’s out-
puts were significantly more reliable. Therefore, we adopted GPT-4o.
4
Page 5:
generated by GPT-4o, ensuring a consistent style and level
of quality throughout the collection. The resulting dataset
comprises 1.04 million QA pairs spanning multiple dis-
ciplines, representing one of the largest collections of
consistency-verified multimodal instruction data available.
3.2. Answer Alignment
The final step in our quality assurance process involved an-
swer alignment to further enhance accuracy. While the pre-
vious refinement step generated consistent answers using
GPT-4o, we recognized the importance of validating these
against authoritative content from the original web sources.
SynthesizedAnswera
LLMJudgeIs the majority of the answers consistent?Yes!NoOutput
Nothing Here
Figure 5. Illustration of our consistency checking methodology.
In this step, we used Gemini-2.0-Flash to measure the
alignment between GPT-generated responses and the orig-
inal extracted answers, if available. In cases where the
comparison indicated inconsistency, we preserved the orig-
inal web-sourced answer. Conversely, when the Gemini
model determined strong alignment between the generated
and web-sourced answers, we retained the GPT-generated
version. Through this alignment process, we combined the
consistency of model-generated content with the authority
of original educational materials in a balanced manner.
3.3. Dataset Statistics
The statistics presented in Table 2 illustrate the distribu-
tion of knowledge domains in our dataset, V ISUAL WE-
BINSTRUCT . While the major categories are shown in
the table, the ”Others” category (6.60%) comprises Gen-
eral Knowledge (2.45%), Computer Science (2.25%), Bi-
ology (1.40%), and humanities subjects, including Lan-
guage/Literature (0.25%), Social Sciences (0.20%), and
Arts (0.05%). This distribution reflects the dataset’s strong
quantitative orientation while ensuring sufficient breadth.
Table 3 summarizes the statistics after each step of the V I-
SUAL WEBINSTRUCT pipeline, showing the data progres-
sion through two main stages. Our approach effectively
scaled the initial 30,000 seed images into a comprehensive
multimodal instruction dataset containing 900K instruction
data. The final dataset includes 347,313 image-associated
QA pairs (approximately 38% of the total) supported by
163,743 unique images.We also conducted thorough decontamination checking
to ensure our training dataset does not contain any data
from the evaluation benchmarks, thereby maintaining the
integrity of our experimental results.
Category Percentage Num of QA Pairs
Math 62.50% 566K
Physics 14.50% 132K
Finance 7.25% 66K
Chemistry 4.80% 43K
Engineering 4.35% 39K
Others 6.60% 60K
Table 2. Distribution of Categories in V ISUAL WEBINSTRUCT
4. Experiments
We detail the training and evaluation details of our experi-
ments in this section.
4.1. Experimental Setup
For our experiments, we directly fine-tuned an exist-
ing MAmmoTH-VL checkpoint on our V ISUAL WEBIN-
STRUCT dataset. We refer to our resulting model as
MAmmoTH-VL2 . The architecture consists of a language
tower based on Qwen2.5-7B-Instruct [59], a vision tower
using SigLip [65], and a projector module connecting these
components, following Llava-OneVision [28, 33].
To enhance data diversity, we employed a data mixing
strategy that combined our V ISUAL WEBINSTRUCT dataset
with modified LLaV A-CoT data [57] (with CoT prompt-
ing tags removed) in a 9:1 ratio, resulting in approxi-
mately 900K samples from V ISUAL WEBINSTRUCT and
100K samples from the modified LLaV A-CoT dataset. This
mixing strategy empirically improved our model’s perfor-
mance across diverse visual reasoning tasks.
We employed a supervised fine-tuning (SFT) approach
with a batch size of 256. The learning rate was set to
1×10−5for the language model and projector compo-
nents, while the vision encoder was fine-tuned with a lower
rate of 2×10−6to preserve its pre-trained visual recogni-
tion capabilities. The model was trained for a single epoch,
which proved sufficient given the high quality and diversity
of our dataset. Input images were processed at a resolu-
tion of 384×384with appropriate adjustments for varied
aspect ratios. We limited input sequences to a maximum
of 8,192 tokens to accommodate detailed reasoning chains
while maintaining computational efficiency.
This fine-tuning approach enabled MAmmoTH-VL2 to
leverage the strong multimodal reasoning foundation of
MAmmoTH-VL while enhancing its performance on our
targeted visual reasoning tasks that require multi-step de-
liberation with visual context.
5
Page 6:
Processing Stage Total QA Pairs Image-Associated QA Unique Questions Total Images Unique Images
Stage 1: Mining Data from the Internet
Seed Data Collection - - - 30,000 30,000
QA Pairs Extraction 421,320 248,643 421,320 552,269 362,728
Post-Processing 361,015 159,059 361,015 331,818 212,530
Stage 2: Dataset Refinement
Answer Refinement 1,041,598 407,218 257,201 577,455 167,493
Answer Alignment 906,160 347,313 257,201 475,099 163,743
Table 3. Statistics of different milestones in the data processing pipeline of V ISUAL WEBINSTRUCT .
Model Size MMMU MMMU-Pro MMMU-Pro MathVista MMVet MathVerse Dyna-Math Avg
val standard vision testmini test testmini test
Closed-sourced Models
GPT-4o - 69.1 54.0 49.7 63.8 76.2 50.2 63.7 61.0
Gemini-1.5-Pro - 59.1 49.4 65.8 63.9 64.0 41.2 64.8 58.3
Claude-3.5-Sonnet - 68.3 55.0 48.0 67.7 75.4 44.2 60.5 59.9
Open-source General Vision-Language Models
Molmo 8B 45.3 28.3 18.9 51.6 58.0 18.9 41.6 37.5
Llava-OV 7B 48.8 29.5 18.7 63.2 58.6 26.2 40.3 40.8
Llama-3.2-Inst 11B 50.7 33.0 23.7 51.5 59.3 31.6 40.5 41.5
Qwen2-VL 7B 52.1 37.0 26.9 58.2 62.0 28.2 42.1 43.8
MAmmoTH-VL 7B 50.8 33.2 25.3 66.0 62.3 34.2 44.7 45.2
InternVL2.5 7B 55.8 38.2 30.4 64.4 62.8 39.5 49.8 48.7
Phi-4-mini 5.6B 55.1 39.7 31.2 62.4 60.5 37.6 51.4 48.6
DeepSeek-VL2 27B 51.1 31.4 24.3 62.8 - - - -
Llava-CoT-L 11B 50.1 31.6 20.4 54.8 60.3 30.2 44.8 41.7
Llava-CoT-M 7B 51.4 33.0 23.7 63.8 58.6 39.4 48.3 45.5
LlamaV-o1 11B 49.1 31.5 22.4 54.4 63.6 - - -
Mulberry 7B 55.0 36.8 23.6 63.1 60.9 31.0 45.1 45.0
Insight-V 8B 50.2 30.7 20.5 59.9 60.8 28.7 47.8 42.6
MM-Eureka 8B 49.2 - - 67.1 60.7 40.4 - -
MAmmoTH-VL2 7B 54.7 40.7 26.3 68.1 64.5 42.6 55.7 50.4
∆over SoTA -1.1 +1.0 -4.9 +2.1 +0.9 +3.1 +4.3 +1.7
Table 4. Evaluation Results of our model and other baseline models. Most of the baseline results are taken from other papers. The best
and second-best results across all open-source models are highlighted in bold and underlined, respectively.
4.2. Evaluation Setup
To assess the capabilities of MAmmoTH-VL2 , we con-
ducted a comprehensive evaluation across multiple multi-
modal benchmarks that specifically test visual reasoning
and knowledge application. Our evaluation framework fo-
cuses on benchmarks that require complex reasoning with
visual context. We evaluate our model on seven key bench-
marks that collectively provide a comprehensive assessment
of multimodal reasoning capabilities:
•MMMU [62] : Tests multimodal understanding across
university-level domains, requiring integration of visual
and textual information.
•MMMU-Pro [63] : Advanced versions of MMMU with
more challenging problems and more distractor options
that require sophisticated visual reasoning.•MathVista [38] : Evaluates mathematical reasoning with
visual inputs, testing the model’s ability to process visual
information for solving complex math problems.
•MMVet [61] : Assesses general multimodal understand-
ing across diverse tasks and contexts.
•MathVerse [68] : Focuses on mathematical reasoning
with visual components and relies less on text hints, re-
quiring complex visual reasoning.
•Dynamath [70] : Tests dynamic mathematical reasoning
capabilities with visual context.
For all evaluations, we used greedy decoding in a zero-
shot setting to ensure fair comparison with existing mod-
els. We categorize the comparison models into three
groups: closed-source models (GPT-4o, Gemini-1.5-Pro,
Claude-3.5-Sonnet), open-source vision-language models
6
Page 7:
(Qwen2-VL [53], LLaV A-OV [28], Molmo [13], etc.),
and reasoning-enhanced vision-language models (LLaV A-
CoT [57], Mulberry [60], etc). We include Llava-CoT-L,
which is trained from Llama-3.2 [18] and Llava-CoT-M,
which is trained from MAmmoTH-VL [19].
To ensure standardized and reproducible evaluations,
we employed LMMsEval [66], a comprehensive evaluation
framework for multimodal language models. For all evalua-
tions, we used greedy decoding in a zero-shot setting to en-
sure fair comparison with existing models. Our approach al-
lows for direct comparison with models of comparable size,
providing insights into the value of the V ISUAL WEBIN-
STRUCT dataset. Performance is reported using accuracy
scores for each benchmark, with an average score across all
benchmarks to indicate overall model capability.
4.3. Experimental Results
Here we evaluate our results from different perspectives.
Quantitative Results The table 4 presents the performance
ofMAmmoTH-VL2 compared to various multimodal mod-
els across seven benchmarks. Our analysis reveals several
important findings regarding the effectiveness of models
fine-tuned on V ISUAL WEBINSTRUCT .
Overall Performance. MAmmoTH-VL2 achieves an aver-
age accuracy of 50.4% across all benchmarks, outperform-
ing other open-source vision-language models of compa-
rable size (7B-11B parameters). This represents a signif-
icant improvement over standard vision-language models
like Qwen2-VL (43.8%), LLaV A-OV (40.8%), and Molmo
(37.5%). It even beats the very recent model like In-
ternVL2.5 [9] and Phi-4-mini-Multimodal [1].
Mathematical Reasoning Capabilities. MAmmoTH-VL2
demonstrates particularly strong performance on mathemat-
ical reasoning tasks. On MathVista, our model achieves
68.1% accuracy, surpassing all the open-source and closed-
source models. The model’s performance on MathVerse
(42.6%) and Dyna-Math (55.7%) further confirms its en-
hanced capability for visual reasoning.
Complex Reasoning Tasks. On MMMU-Pro-std with 10
options, MAmmoTH-VL2 achieves 40.7% accuracy, show-
ing a significant improvement over other 7B models such as
LLaV A-OV (29.5%) and Qwen2-VL (37.0%). This demon-
strates that our approach effectively enhances the model’s
ability to perform complex reasoning across diverse do-
mains beyond mathematics.
Gap with Larger and Closed-Source Models. While
MAmmoTH-VL2 outperforms open-source models of com-
parable size, there remains a gap with closed-source models
such as GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet.
This indicates potential for further improvements through
scaling or enhanced training methodologies.
Comparison with Reasoning-Enhanced Models. Among
the reasoning-enhanced vision-language models like Llava-CoT, Mulberry [60], LlamaV-o1 [51] and Insight-V [14],
MAmmoTH-VL2 demonstrates competitive performance,
achieving results comparable to or better than specialized
models like LLaV A-CoT and Mulberry. For instance,
on MMMU-Pro Vision, our model achieves 26.3% accu-
racy, outperforming LLaV A-CoTM’s 23.7%. Notably, other
reasoning-enhanced models often utilize complex method-
ologies in either the training or inference stage to enhance
their chain-of-thought abilities, which makes the develop-
ment process and deployment more complicated. In con-
trast, MAmmoTH-VL2 achieves much better reasoning ca-
pabilities through our straightforward fine-tuning approach
on V ISUAL WEBINSTRUCT , offering a simpler yet effective
solution compared to the other approaches.
These results confirm that fine-tuning on V ISUAL WE-
BINSTRUCT significantly enhances the model’s reason-
ing capabilities. The consistent performance improve-
ments across diverse benchmarks from non math-related
and math-related domains demonstrate the effectiveness of
our approach in developing more capable multimodal rea-
soning models. We believe our dataset can be utilized to
augment future vision-language models.
4.4. Ablation Study
The ablation study in Table 5 demonstrates the impact
of different training datasets and their combinations on
model performance across multiple visual reasoning bench-
marks. Two base models were evaluated: Llava-OV-mid
and MAmmoTH-VL.
For Llava-OV-mid, the baseline starts at 26.3% average
score across benchmarks. Training with Llava-CoT data
improves this to 33.6%, while training on V ISUAL WEBIN-
STRUCT yields an even better 38.1%, with with MMVet per-
formance notably jumping from 32.1% to 57.6%. The com-
bined training approach (V ISUAL WEBINSTRUCT +Llava-
CoT) achieves the best overall performance at 39.7%.
The stronger MAmmoTH-VL model begins with an av-
erage score of 45.4%. Training with V ISUAL WEBIN-
STRUCT improves the average to 49.0%, showing gains
across multiple benchmarks, particularly in MMMU-Pro vi-
sion and Dyna-Math tests. As with Llava-OV-mid, the com-
bined training approach works best, reaching 50.4% aver-
age score, with notable improvements in MMMU (54.7%),
MMMU-Pro standard (40.7%), and Dyna-Math (55.7%).
The key findings indicate strong data complementarity
between V ISUAL WEBINSTRUCT and Llava-CoT, with their
combination consistently delivering the best results. We
also observe that weaker base models show larger relative
improvements from training. Overall, the ablation study
confirms that our V ISUAL WEBINSTRUCT dataset signifi-
cantly boosts model performance across all benchmarks,
demonstrating its effectiveness in enhancing visual reason-
ing capabilities regardless of the base model.
7
Page 8:
Training Data MMMU MMMU-Pro MMMU-Pro MathVista MMVet MathVerse Dyna-Math Avg
val standard vision testmini test testmini test
Training from LLava-OV-mid
- 40.1 21.2 12.2 36.0 32.1 18.1 24.4 26.3
Llava-CoT 40.8 25.8 14.6 45.7 47.5 27.2 33.9 33.6
Ours 45.3 31.5 20.9 43.9 57.6 27.4 40.3 38.1
Ours+Llava-CoT 47.6 31.6 20.9 48.8 51.7 34.9 42.3 39.7
Training from MAmmoTH-VL
- 50.8 34.8 25.3 66.0 62.3 34.2 44.7 45.4
Llava-CoT 51.4 35.2 24.6 63.8 58.7 39.4 48.3 45.9
Ours 52.6 38.6 29.0 65.9 61.8 39.4 55.7 49.0
Ours+Llava-CoT 54.7 40.7 26.3 68.1 64.5 42.6 55.7 50.4
Table 5. Ablation Results of our experiments. We show experimental results from different backbones to show the impact of consistency
filtering and data mixing with Llava-CoT. For each base model, the best performance is highlighted in bold.
5. Related Works
5.1. Multimodal Instruction Data
Creating high-quality multimodal datasets remains a sig-
nificant challenge in advancing MLLMs. Current ap-
proaches face critical limitations, particularly in balanc-
ing quality and scale. Human-annotated datasets provide
high-precision, contextually appropriate data [13, 42, 48,
58] but suffer from prohibitive costs and scalability con-
straints. Meanwhile, methods leveraging existing academic
datasets [34, 52] offer more cost-effective alternatives but
lack the diversity and reasoning complexity needed for ad-
vanced multimodal reasoning tasks. This limitation is par-
ticularly evident in the scarcity of large-scale, reasoning-
focused multimodal datasets that can be efficiently pro-
duced. Our work addresses these challenges by proposing
a novel, scalable methodology for constructing multimodal
instruction datasets that maintain both the quality and rea-
soning complexity.
5.2. Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have ad-
vanced AI by integrating text and visual processing capa-
bilities. While proprietary models such as GPT-4o [22] and
Gemini [49, 50] achieve state-of-the-art performance, they
remain inaccessible to the broader research community.
To address this gap, connector-based approaches [12, 30]
have emerged, linking visual encoders to language models
through lightweight projection modules.
Recent open-source MLLMs, such as LLAMA [18],
LLaV A [27, 32], MiniGPT-4 [69], and Deepseek-VL [35],
have contributed to advancements in vision-language un-
derstanding. Additionally, Qwen-VL [53] and InternVL [9]
have demonstrated strong performance through efficient de-
sign and diverse pre-training.
Meanwhile, various approaches have been developedto enhance MLLM reasoning capabilities, including neu-
ral symbolic methods [2, 10], optimized visual encoding
strategies [23, 31], plan-based prompting [39, 43], struc-
tured reasoning frameworks [57], and sequential instruc-
tion tuning [21]. Despite these advancements, these models
face a critical challenge: the scarcity of publicly available
large-scale visual reasoning datasets necessary for enhanc-
ing model reasoning capabilities [4]. Our work addresses
this supervised fine-tuning data bottleneck while building
on the connector-training paradigm, aiming to bridge the
gap between proprietary and open-source multimodal mod-
els to foster more accessible vision-language systems.
5.3. Chain-of-Thought in Large Language Models
Chain-of-Thought (CoT) prompting [55] has revolution-
ized how large language models tackle complex reason-
ing challenges. This technique enables LLMs to navi-
gate difficult problems—including commonsense scenar-
ios [17, 46] and logical puzzles [29, 56]—by following ex-
plicit reasoning pathways. At its core, CoT methodically
decomposes complex questions into manageable sequen-
tial steps, creating a structured framework that guides mod-
els toward systematic solutions [11]. Evidence consistently
demonstrates significant improvements in reasoning perfor-
mance through this approach. Notable advancements in-
clude Prism [44], which implements a distinctive dual-stage
architecture that separates initial perception from subse-
quent reasoning operations, and MSG [5], which pioneered
the forced Chain-of-Thought methodology—establishing a
foundational paradigm shift in structured prompting ap-
proaches that continues to shape current research.
6. Conclusion
In this paper, we explore the possibility of constructing
large-scale multimodal reasoning datasets without relying
8
Page 9:
on human annotation. We are the first paper to utilize
Google Image Search for mining high-quality visual rea-
soning dataset. Our approach has been highly effective to
achieve state-of-the-art performance on 5 out of 7 evaluated
benchmarks. In the future, we plan to work on multiple
round of search to further expand the dataset size.
References
[1] Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson,
Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim,
Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong
Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-
Chun Chen, Yi ling Chen, Qi Dai, Xiyang Dai, Ruchao
Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami,
Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mah-
moud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee,
Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin,
Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo,
Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen,
Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet,
Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang,
Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song,
Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yim-
ing Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Wei-
jian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir,
Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, and Xiren
Zhou. Phi-4-mini technical report: Compact yet powerful
multimodal language models via mixture-of-loras, 2025. 2,
7
[2] Saeed Amizadeh, Hamid Palangi, Oleksandr Polozov,
Yichen Huang, and Kazuhito Koishida. Neuro-symbolic vi-
sual reasoning: Disentangling ”visual” from ”reasoning”,
2020. 8
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh.
Vqa: Visual question answering. In Proceedings of the IEEE
international conference on computer vision , pages 2425–
2433, 2015. 1
[4] Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li,
Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping
Huang, Jiulong Shan, Conghui He, Binhang Yuan, and Wen-
tao Zhang. A survey of multimodal large language model
from a data-centric perspective, 2024. 8
[5] Franz Louis Cesista. Multimodal structured generation:
Cvpr’s 2nd mmfm challenge technical report, 2025. 8
[6] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-
Lussier, and Ningchuan Xiao. Mapqa: A dataset for
question answering on choropleth maps. arXiv preprint
arXiv:2211.08545 , 2022. 1
[7] Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang,
Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geo-
metric question answering benchmark towards multimodal
numerical reasoning, 2022. 1, 2
[8] Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing
Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu,
Ziyan Jiang, Bohan Lyu, et al. Mega-bench: Scaling multi-modal evaluation to over 500 real-world tasks. arXiv preprint
arXiv:2410.10563 , 2024. 1
[9] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang-
wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian,
Zhaoyang Liu, et al. Expanding performance boundaries of
open-source multimodal models with model, data, and test-
time scaling. arXiv preprint arXiv:2412.05271 , 2024. 2, 7,
8
[10] Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao
Yang, Sahil Shah, and Sandeep Chinchali. Towards Neuro-
Symbolic Video Understanding , page 220–236. Springer Na-
ture Switzerland, 2024. 8
[11] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu,
Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin,
and Ting Liu. Navigate through enigmatic labyrinth a sur-
vey of chain of thought reasoning: Advances, frontiers and
future, 2024. 8
[12] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
Fung, and Steven Hoi. Instructblip: Towards general-
purpose vision-language models with instruction tuning,
2023. 8
[13] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri-
pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi,
Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu,
Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo,
YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-
Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli
VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda,
Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron
Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper
Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt,
Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin
Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna,
Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross
Girshick, Ali Farhadi, and Aniruddha Kembhavi. Molmo
and pixmo: Open weights and open data for state-of-the-art
vision-language models, 2024. 7, 8
[14] Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang,
Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex-
ploring long-chain visual reasoning with multimodal large
language models. arXiv preprint arXiv:2411.14432 , 2024. 7
[15] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di
He, and Liwei Wang. Towards revealing the mystery behind
chain of thought: a theoretical perspective. Advances in Neu-
ral Information Processing Systems , 36:70757–70798, 2023.
1
[16] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun
Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu,
Zhenguo Li, and Lingpeng Kong. G-llava: Solving geomet-
ric problem with multi-modal large language model, 2023.
1, 2
[17] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan
Roth, and Jonathan Berant. Did aristotle use a laptop?
a question answering benchmark with implicit reasoning
strategies, 2021. 8
[18] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab-
hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
9
Page 10:
Letman, Akhil Mathur, Alan Schelten, Alex Vaughan,
et al. The llama 3 herd of models. arXiv preprint
arXiv:2407.21783 , 2024. 7, 8
[19] Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang,
King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi-
ang Yue. Mammoth-vl: Eliciting multimodal reasoning with
instruction tuning at scale. arXiv preprint arXiv:2412.05237 ,
2024. 2, 7
[20] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi
Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham.
Vizwiz grand challenge: Answering visual questions from
blind people. In Proceedings of the IEEE conference on
computer vision and pattern recognition , pages 3608–3617,
2018. 1
[21] Hanxu Hu, Simon Yu, Pinzhen Chen, and Edoardo M. Ponti.
Fine-tuning large language models with sequential instruc-
tions, 2024. 8
[22] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel-
man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli-
hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.
arXiv preprint arXiv:2410.21276 , 2024. 2, 4, 8
[23] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao,
and Li Yuan. Chat-univi: Unified visual representation em-
powers large language models with image and video under-
standing, 2024. 8
[24] Justin Johnson, Bharath Hariharan, Laurens Van
Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross
Girshick. Clevr: A diagnostic dataset for compositional
language and elementary visual reasoning. In Proceedings
of the IEEE conference on computer vision and pattern
recognition , pages 2901–2910, 2017. 1, 2
[25] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin-
son, ´Akos K ´ad´ar, Adam Trischler, and Yoshua Bengio. Fig-
ureqa: An annotated figure dataset for visual reasoning.
arXiv preprint arXiv:1710.07300 , 2017. 1, 2
[26] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon
Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is
worth a dozen images, 2016. 1, 2
[27] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li,
Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi-
wei Liu, et al. Llava-onevision: Easy visual task transfer.
arXiv preprint arXiv:2408.03326 , 2024. 8
[28] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li,
Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi-
wei Liu, et al. Llava-onevision: Easy visual task transfer.
arXiv preprint arXiv:2408.03326 , 2024. 2, 5, 7
[29] Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, and Jie
Chen. Weakly-supervised 3d spatial reasoning for text-based
visual question answering. IEEE Transactions on Image Pro-
cessing , 32:3367–3382, 2023. 8
[30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models, 2023. 8
[31] Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song
Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker:
Efficient visual projector for multimodal llm, 2024. 8
[32] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
Visual instruction tuning, 2023. 8[33] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
Visual instruction tuning. Advances in neural information
processing systems , 36:34892–34916, 2023. 1, 5
[34] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie
Qian, Zhilin Yang, and Jie Tang. Gpt understands, too, 2023.
8
[35] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong,
Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao
Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie,
and Chong Ruan. Deepseek-vl: Towards real-world vision-
language understanding, 2024. 8
[36] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang,
Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable
geometry problem solving with formal language and sym-
bolic reasoning. In The 59th Annual Meeting of the Associa-
tion for Computational Linguistics (ACL) , 2021. 2
[37] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao,
Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu.
Iconqa: A new benchmark for abstract diagram under-
standing and visual language reasoning. arXiv preprint
arXiv:2110.13214 , 2021. 2
[38] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li,
Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel
Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat-
ical reasoning of foundation models in visual contexts. arXiv
preprint arXiv:2310.02255 , 2023. 1, 2, 6
[39] Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wen-
gang Zhou, and Houqiang Li. Textcot: Zoom in for enhanced
multimodal text-rich image understanding, 2024. 8
[40] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty,
and Enamul Hoque. Chartqa: A benchmark for question an-
swering about charts with visual and logical reasoning. arXiv
preprint arXiv:2203.10244 , 2022. 1, 2
[41] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar.
Docvqa: A dataset for vqa on document images. In Proceed-
ings of the IEEE/CVF winter conference on applications of
computer vision , pages 2200–2209, 2021. 1
[42] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier,
Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xi-
anzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian
Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu
H`e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan
Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam
Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang,
Peter Grasch, Alexander Toshev, and Yinfei Yang. Mm1:
Methods, analysis & insights from multimodal llm pre-
training, 2024. 8
[43] Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei
Herzig. Compositional chain-of-thought prompting for large
multimodal models, 2024. 8
[44] Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang,
Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai
Chen. Prism: A framework for decoupling and assessing the
capabilities of vlms, 2024. 8
[45] Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal,
and Pushpak Bhattacharyya. Scienceqa: A novel resource
for question answering on scholarly articles. International
Journal on Digital Libraries , 23(3):289–301, 2022. 1, 2
10
Page 11:
[46] Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi,
and Dan Roth. Commonsense reasoning for natural language
processing. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics: Tutorial Ab-
stracts , pages 27–33, Online, 2020. Association for Compu-
tational Linguistics. 8
[47] Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang,
See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-
llava: Bootstrapping mathematical reasoning for multimodal
large language models, 2024. 2
[48] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu,
Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui,
Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor
Darrell. Aligning large multimodal models with factually
augmented rlhf, 2023. 8
[49] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell,
Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent,
Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking
multimodal understanding across millions of tokens of con-
text. arXiv preprint arXiv:2403.05530 , 2024. 1, 8
[50] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell,
Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent,
Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking
multimodal understanding across millions of tokens of con-
text. arXiv preprint arXiv:2403.05530 , 2024. 8
[51] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh
Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed
Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-
o1: Rethinking step-by-step visual reasoning in llms. arXiv
preprint arXiv:2501.06186 , 2025. 7
[52] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun
Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang,
Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang,
Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1:
A fully open, vision-centric exploration of multimodal llms,
2024. 8
[53] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan,
Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin
Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui
Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun-
yang Lin. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution, 2024. 7, 8
[54] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large lan-
guage models. Advances in neural information processing
systems , 35:24824–24837, 2022. 1
[55] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny
Zhou. Chain-of-thought prompting elicits reasoning in large
language models, 2023. 8
[56] Siheng Xiong, Yuan Yang, Ali Payani, James C Kerce, and
Faramarz Fekri. Teilp: Time prediction over knowledge
graphs via logical reasoning, 2024. 8
[57] Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and
Li Yuan. Llava-cot: Let vision language models reason step-
by-step, 2025. 2, 5, 7, 8[58] Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying
Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang.
Vision-flan: Scaling human-labeled tasks in visual instruc-
tion tuning, 2024. 8
[59] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo
Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang,
Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint
arXiv:2412.15115 , 2024. 5
[60] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang,
Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song,
Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Em-
powering mllm with o1-like reasoning and reflection via col-
lective monte carlo tree search, 2024. 2, 7
[61] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
Mm-vet: Evaluating large multimodal models for integrated
capabilities. In International Conference on Machine Learn-
ing, pages 57730–57754. PMLR, 2024. 6
[62] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi
Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming
Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline
multimodal understanding and reasoning benchmark for ex-
pert agi. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition , pages 9556–
9567, 2024. 1, 2, 6
[63] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang,
Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge
Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-
discipline multimodal understanding benchmark. arXiv
preprint arXiv:2409.02813 , 2024. 6
[64] Xiang Yue, Tianyu Zheng, Ge Zhang, and Wenhu Chen.
Mammoth2: Scaling instructions from the web. Advances
in Neural Information Processing Systems , 37:90629–90660,
2025. 2
[65] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
Lucas Beyer. Sigmoid loss for language image pre-training.
InProceedings of the IEEE/CVF international conference on
computer vision , pages 11975–11986, 2023. 5
[66] Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu,
Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan
Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-
eval: Reality check on the evaluation of large multimodal
models, 2024. 7
[67] Lei Zhang and Yong Rui. Image search—from thousands
to billions in 20 years. ACM Transactions on Multimedia
Computing, Communications, and Applications (TOMM) , 9
(1s):1–20, 2013. 2
[68] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin,
Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei
Chang, Yu Qiao, et al. Mathverse: Does your multi-modal
llm truly see the diagrams in visual math problems? In Euro-
pean Conference on Computer Vision , pages 169–186, 2024.
6
[69] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
hamed Elhoseiny. Minigpt-4: Enhancing vision-language
understanding with advanced large language models, 2023.
8
11
Page 12:
[70] Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin
Hu, and Huan Zhang. Dynamath: A dynamic visual bench-
mark for evaluating mathematical reasoning robustness of vi-
sion language models. In The Thirteenth International Con-
ference on Learning Representations , 2024. 2, 6
12
Page 13:
7. Supplementary Material
7.1. Image Number Distribution
0 images 1 image 2 images 3 images 4 images 5+ images0102030405060Percentage of Entries558847 (61.7%)
283516 (31.3%)
34830 (3.8%)
12059 (1.3%) 8402 (0.9%) 8506 (0.9%)Number of Images Distribution in VisualWebInstruct
Total Entries: 906160
7.2. MAmmoTH-VL2 Training Configuration
Model Architecture Data Processing
Base Language Model Qwen/Qwen2.5-7B-Instruct Image Aspect Ratio anyres max 4
Vision Encoder google/siglip-so400m-patch14-384 Image Grid Pinpoints (1x1),...,(6x6)
Vision-Language Connector MLP-based projector (2-layer with GELU) Group by Modality Enabled
Vision Select Layer -2 (second-to-last layer) Image Start/End Tokens Disabled
Patch Merge Type spatial unpad Image Patch Token Disabled
Starting Checkpoint MAmmoTH-VL Lazy Preprocessing Enabled
Training Configuration Dataset Configuration
Training Epochs 1 Primary Dataset VisualWebInstruct
Batch Size 256 Additional Dataset LLaV A-CoT (9:1 ratio)
Maximum Sequence Length 8,192 tokens Prompt Template qwen 25
Learning Rate 1e-5 (language and projector)
Vision Tower Learning Rate 2e-6
Weight Decay 0.0
Warmup Ratio 0.03
LR Scheduler Cosine
Tunable Components Optimization
Language Model Enabled Distributed Training DeepSpeed Zero-3
Vision Tower Enabled TF32 Precision Enabled
MLP Adapter Enabled Mixed Precision BF16
Gradient Checkpointing Enabled TF32 Precision Enabled
Torch Compile Enabled (inductor)
13
Page 14:
7.3. Prompt for Each Stage
QA Pairs Extraction
"""Analyze this webpage content and extract questions, images, and
complete solution details in Markdown format.
Please format your response as follows:
**Question 1: **
[complete question text]
**Images: **
*[First image URL if available]
*[Second image URL if available]
[continue for each additional image...]
**Solution: **
[Copy the complete solution text from the webpage, including all steps,
explanations, and calculations]
**Images in Solution: **
*[First image URL if available]
*[Second image URL if available]
[continue for each additional image...]
[repeat for each additional question...]
Requirements:
- Keep the complete solution text exactly as shown in the webpage
- Use Markdown formatting throughout the response
- Mark missing content as "Not found"
- For images, include URL only
- For multiple questions, number them sequentially
- Do not summarize or modify the solution text
- Preserve all mathematical notations and formulas
- Keep all step-by-step explanations intact
- Preserve all line breaks and indentation in solution text
- If there is no question in the content, mark it as "Not found"
- If the webpage is empty or missing, return nothing
Webpage content:
{Accessibility Tree }
"""
14
Page 15:
QA Pairs Validation
"""Please analyze this question-answer pair and its images:
Question: complete question text
Solution: complete solution text
Your tasks:
1. Determine if the question is meaningful and valid.
2. For the question images (if any), determine if each is:
- Properly referenced in the question
- Clear and visible
- Actually helps understand the question
3. For the solution images (if any), determine if each is:
- Helps explain the solution
Notes:
- Image indices start from 0 (e.g., first image is index 0, second is index 1, etc.)
- Images should be marked as valid if they show the actual content being discussed
- Images should be marked as invalid only if they are:
*Completely irrelevant to the question/solution
*Corrupted or unreadable
*Duplicate or redundant
Question Images:
[Images loaded here] Solution Images (starting a new section, indexes reset to 0):
[Images loaded here] Please respond in this exact format:
QUESTION VALID: [yes/no]
ANALYSIS: [Brief explanation of why the question is valid/invalid]
QUESTION IMAGES: [comma-separated list of valid image indices starting from 0]
QUESTION IMAGES REASON: [Brief explanation for each image decision]
SOLUTION IMAGES: [comma-separated list of valid image indices starting from 0]
SOLUTION IMAGES REASON: [Brief explanation for each image decision]
CRITICAL RESPONSE FORMAT INSTRUCTIONS:
- You MUST respond using EXACTLY this format with no additional text
- Use ONLY numeric indices for images, starting from 0
- If no images are valid, use an empty string
- Be precise and use actual numbers
- Always use numeric indices (0,1,2...)
- Use empty string for no images (e.g., "SOLUTION IMAGES: ")
- Do not add explanatory text in the indices field
"""
15
Page 16:
Answer Alignment
"""Given the question and the provided image(s), compare these two answers and
determine if they are aligned.
Question: question
GPT’s Answer: gptanswer
Real Answer: realanswer
Example of Aligned Answers:
Question: What is 2 + 2?
GPT Answer: 4
Real Answer: 4
Example of Misaligned Answers:
Question: What is derivative of x2?
GPT Answer: 2x + 1
Real Answer: 2x
Are these answers aligned? Respond with just ’Yes’ or ’No’ on the first line. Provide
a brief explanation on the second line."""
16
Page 17:
7.4. Example of QA Pair
Question:How many angles does a quadrilateral have? A. 1 B. 2 C. 3 D. 4Web Answer:Right Answer is: D SOLUTION An angle is made up of two rays starting from a common initial point (or a common vertex). A quadrilateral has 4 angles. Consider quadrilateral ABCD. It can be seen here that there are 4 angles here; ∠A , ∠B , ∠C and ∠D .GPT Answer1:To determine how many angles a quadrilateral has, let's reason through it step by step: 1. **Definition of a Quadrilateral**: A quadrilateral is a polygon with four sides. 2. **Understanding Angles**: In any polygon, the number of angles is equal to the number of sides because each angle is formed where two sides meet. 3. **Applying the Definition**: Since a quadrilateral has four sides, by definition, it will also have four angles. 4. **Double-check with Examples**: Consider common examples of quadrilaterals like squares, rectangles, and diamonds—all these shapes have four angles. So, based on the above reasoning, the quadrilateral depicted in the image has four angles. Answer: D. 4
GPT Answer2: To determine how many angles a quadrilateral has, let's break it down step by step: 1. **Definition of a Quadrilateral**: A quadrilateral is a polygon with four sides. 2. **Interior Angles in a Quadrilateral**: By definition, any polygon with four sides (a quadrilateral) will have four interior angles. 3. **Examples and Properties**: -Consider a square, rectangle, or any general quadrilateral. Each has four corners, and at each corner, there is an angle formed between two sides. -A rectangle, for example, has four right angles. A rhombus has four angles, the sum of which is also 360 degrees. 4. **Summation of Angles**: The sum of the interior angles of any quadrilateral is 360 degrees. This sum indicates there must be four angles, as each angle contributes to this total. 5. **Checking against other Shapes**: -A triangle, which is a three-sided polygon, has three angles. -Extending to more sides, a pentagon (five-sided polygon) has five angles. -This shows the number of angles is directly determined by the number of sides of the polygon. Based on the above reasoning, a quadrilateral has 4 angles. Hence, the correct answer is: `Answer: 4`GPT Answer3: A quadrilateral is a polygon with four sides. By definition, a polygon has the same number of angles as it does sides. Therefore: 1. A quadrilateral has four sides. 2. Since it has four sides, it will also have four angles. Thus, a quadrilateral has four angles. Answer: D. 4GPT Answer4: A quadrilateral is a polygon with four sides. By definition, any polygon with \( n \) sides will also have \( n \) angles. Since a quadrilateral has 4 sides, it must also have 4 angles. Let's break this down: 1. Identify the shape: The given shape is a quadrilateral. 2. Determine the number of sides: A quadrilateral has 4 sides. 3. Determine the number of angles: In any polygon, the number of angles is equal to the number of sides. Therefore, a quadrilateral has 4 angles. Answer: D. 4
17