loader
Generating audio...

arxiv

Paper 2503.10582

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Authors: Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen

Published: 2025-03-13

Abstract:

Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

Paper Content:
Page 1: VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search k,tYiming Jia*,nJiachen Li,mXiang Yue,zBo Li,xPing Nie,yKai Zou,kWenhu Chen kUniversity of Waterloo,tUniversity of Toronto,nUC Santa Barbara, mCMU,zNUS,xIndependent,yNetmind.ai {yiming.jia@mail.utoronto.ca, wenhuchen@uwaterloo.ca } https://tiger-ai-lab.github.io/VisualWebInstruct Abstract Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct — a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disci- plines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing simi- lar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% be- ing visual QA pairs and the rest as text QA pairs. Mod- els fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) train- ing from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art perfor- mance within the 10B parameter class on MMMU-Pro- std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs’ reasoning capabilities for com- plex multimodal tasks. 1. Introduction Vision-Language Models (VLMs), such as Llava [33] and Gemini [49], are designed to process multimodal inputs, including images, videos, and text. While VLMs have recently demonstrated significant progress in straightfor- ward perceptual tasks such as VQA [3], DocVQA [41], *Published during an internship at University of Waterloo Seed images Google Lens ExtractWeb pagesVisualWebInstruct50.13354.830.242.152.13758.228.240.554.740.768.142.655.7MMMUMMMU-PROMATHVISTAMATHVERSEDYNA-MATHLlava-CoTQwen2-VLMAmmoTH-VL2Scaling the Visual Reasoning Data Through Google Lens Figure 1. Overview of our automated data curation approach and major experimental results. and VizWiz [20], they often struggle with more complex tasks such as MMMU [62], MathVista [38], and MEGA- Bench [8], which require multi-step, deliberate reason- ing [15, 54]. One major bottleneck for existing VLMs is the scarcity of reasoning-focused training datasets. Current multimodal reasoning datasets exhibit several limitations: (1) Many datasets, such as FigureQA [25], MapQA [6], GeoQA [7], and ChartQA [40], focus narrowly on spe- cific types of scientific images. (2) Some datasets rely on synthetic images generated through predefined rules, such as CLEVR [24] and Geo170K [16], which often result in poor generalization to real-world visual reasoning tasks. (3) Other training datasets, such as AI2D [26] and Sci- enceQA [45], are relatively small and simplistic, primarily covering elementary-level visual knowledge. Due to these limitations, VLMs fail to acquire diverse reasoning skills, leading to slower progress on reasoning-intensive bench- marks compared to language models. 1arXiv:2503.10582v1 [cs.CV] 13 Mar 2025 Page 2: Dataset Size Source & Domains Coverage ScienceQA [45] 21K Elementary and high school science Science Q&A, diagrams, K-12 Exam IconQA [37] 107K Abstract diagrams and visual reasoning Visual reasoning, diagrams Geo170K [16] 170K Synthesized from LLMs Geometry CLEVR [24] 700K Synthesized from rules Shapes FigureQA [25] 1.3M Synthesized from rules Bar, Line, Pie ChartQA [40] 23K Charts from Staista, Pew, etc Charts Math360V [47] 260K FigureQA [25], CLEVR [24], IconQA [37], etc Math reasoning, diagrams Mulberry [60] 260K Geo3K [36], IconQA [37], ChartQA [40], ScienceQA [45], etc Geo, Figure, Medical, K-12 Exam Llava-CoT [57] 100K ChartQA [40], AI2D [26], GeoQA [7], CLEVR [24], etc Geo, General VQA, K-12 Exam VISUAL WEBINSTRUCT 906K Internet (Homework Website, Forums, etc) All Above + College Exams Table 1. Comparison between our dataset and the existing datasets. V ISUAL WEBINSTRUCT is the most diverse dataset with very broad coverage of disciplines and image types. Given the difficulty of human annotation, we draw in- spiration from WebInstruct [64] to mine naturally exist- ing reasoning-focused instruction data from the internet. While WebInstruct retrieves reasoning-focused text data from Common Crawl1, their approach is infeasible for the multimodal domain due to two key challenges: (1) the lack of a comparable large-scale multimodal dataset, and (2) the unreliability of current multimodal retrieval models. To ad- dress these challenges, we leverage commercial web image search tools, such as Google Image Search, which offer high coverage and accuracy. We begin by collecting approximately 30,000 seed im- ages across multiple disciplines, including Accounting, Chemistry, Mathematics, and Physics. These images serve as queries for Google Image Search [67] to identify web- sites containing similar images. We then download the HTMLs from these websites and extract their accessibility trees, which are processed by an LLM to extract QA pairs (if any) for an initial dataset. However, we found that over half of the extracted questions lack annotated answers due to three primary reasons: (1) these websites do not provide answers, (2) some require membership to access, and (3) some necessitate user interaction to reveal the answers. To address this, we use GPT-4o [22] to synthesize multiple can- didate solutions for each question, filtering for consistency among responses. Finally, we align the selected answers with the content from original webpage to remove poten- tial inaccurate ones. Through this sophisticated process, we construct V ISUAL WEBINSTRUCT , a dataset containing ap- proximately 900K QA pairs, where 40% are visual QA pairs associated with 163,743 unique images, while the remain- ing 60% are text-only QA pairs. Most of them are exam-like problems requiring deliberate reasoning. Table 1 compares V ISUAL WEBINSTRUCT with other datasets in terms of source and coverage. Our dataset comprises highly diverse, human-created scientific images 1https://commoncrawl.org/spanning multiple disciplines and levels of complexity. Its broad coverage and increased difficulty make it particu- larly well-suited for improving VLM performance on real- world tasks requiring multi-step reasoning. To evaluate the effectiveness of V ISUAL WEBINSTRUCT , we perform su- pervised fine-tuning on MAmmoTH-VL [19] and Llava- OV-mid [28]. Comprehensive evaluations across seven visual reasoning benchmarks, including MMMU [62], MathVista [38], and Dyna-Math [70], demonstrate sub- stantial performance gains. When fine-tuning Llava- OV-mid, we observe an absolute improvement of 10– 20 percentage points across these benchmarks. When fine-tuning MAmmoTH-VL, our model MAmmoTH-VL2 achieves state-of-the-art performance (within the 10B pa- rameter range) on several benchmarks, including MMMU- Pro-std (40.7%), MMVet (64.5%), MathVerse (42.6%), and Dyna-Math (55.7%). MAmmoTH-VL2 ’s average perfor- mance across seven benchmarks surpasses strong competi- tors such as InternVL2.5 [9] and Phi-4-Mini [1], underscor- ing the effectiveness of V ISUAL WEBINSTRUCT in enhanc- ing VLMs’ reasoning capabilities. Our contributions can be summarized as follows: • We propose a scalable pipeline for acquiring high-quality multimodal reasoning data from the internet, ensuring both scalability and quality. • We introduce V ISUAL WEBINSTRUCT , a diverse and comprehensive multimodal instruction dataset, which we will publicly release to the research community. • We develop MAmmoTH-VL2 , a 7B-parameter vision- language model fine-tuned on V ISUAL WEBINSTRUCT , achieving state-of-the-art performance among models of comparable size and excelling in complex reasoning tasks requiring multi-step deliberation with visual context. In the following sections, we will first talk about how we mine the data from the Internet in section 2 and then talk about how to refine it in section 3. Finally, we show our experimental results in section 4. 2 Page 3: Seed ImageWebpageAccessibility TreeInitial Dataset Consistency Filtering Y/FVerified Answer VisualWebInstruct Post-ProcessStage 1:Stage 2:Figure 2. Comprehensive Pipeline for V ISUAL WEBINSTRUCT Dataset Generation. The workflow illustrates our multi-stage approach for creating high-quality multimodal instruction data. Stage 1 : starting with seed images, we leverage Google Image search to identify relevant webpages, which are processed into accessibility trees. The raw QA pairs are extracted from the trees and refined through a post-processing step to ensure the vadality the data. Stage 2 : we first generat multiple synthesized answers for consistency filtering, then align these with original web-sourced content to enhance the accuracy of the answers. 2. Stage 1: Mining Data from the Internet Our data mining pipeline follows a systematic approach to extract image-rich QA pairs from the internet. We be- gin with approximately 30K scientific images as seed data spanning multiple disciplines. We employ Google Im- age Search to identify visually similar content, gathering 758,490 unique URLs. After filtering out irrelevant do- mains, we construct accessibility trees for the relevant web- sites to extract meaningful content, preserving both textual and visual information while eliminating non-essential ele- ments. We then leverage the Gemini 1.5 Flash model in a two-stage process: first to automatically extract QA pairs from the accessibility trees and then to filter these pairs based on comprehensive quality criteria, including question validity and image relevance, ensuring the educational value and integrity of the final dataset. 2.1. Seed Data collecting Due to the limited availability of image-rich QA datasets and the predominant focus on mathematics in existing datasets, creating a comprehensive QA dataset that incorpo- rates diverse subjects and abundant visual content is essen- tial. Our seed dataset consists of approximately 30,000 im- ages, which were crawled from Stemez2in compliance with copyright regulations. These images span multiple disci- plines, including mathematics, physics, accounting, chem- istry, engineering, and biology, ensuring both subject diver- sity and visual richness. 2.2. Google Image Searching Using the seed images, we conducted Google Image searches to find visually similar content across the web. Leveraging Google Lens (Figure 3), we collected approxi- 2https://stemez.com/subjects/science/mately 60 URLs per image, resulting in a total of 1,747,634 URLs containing visually similar content. Many web- sites with non-permissive licenses implement anti-crawling mechanisms, and we ensured compliance by avoiding data collection from such sources. We applied rigorous dedu- plication and filtering, removing URLs from domains un- likely to contain educational content (e.g., video platforms and image repositories). This refinement yielded 758,490 unique, high-quality URLs for further processing. By using images as primary search keys, we ensured strong visual and contextual connections between the collected data and our seed dataset, effectively preserving the original distri- bution while significantly expanding its coverage. Query Figure 3. Example of Google Lens search functionality for circle geometry problems. 2.3. Accessibility Tree Building After filtering out irrelevant domains, we processed the HTML content of each remaining URL to construct acces- 3 Page 4: sibility trees that capture essential textual and visual infor- mation. As illustrated in Figure 4, our implementation fo- cuses on extracting meaningful text content and image ele- ments while filtering out non-essential components such as navigation menus, advertisements, and auxiliary elements. We developed a tree-based structure where each node rep- resents either textual content or an image, preserving the hi- erarchical relationships present in the original HTML while removing unnecessary markup and styling information. The resulting accessibility trees provide a clean, hierarchical representation of each webpage’s content, making subse- quent QA pair extraction more efficient and reliable. RootTextImageTextQuestion: In the following figure... Option: A: 75, B: 60, C: 30, D: 45 TextAnswer: Chord AB subtends... Figure 4. Example of an accessibility tree structure extracted from an educational website. 2.4. QA Pairs Extraction After constructing accessibility trees, we prompt the Gem- ini 1.5 Flash model to identify and extract high-quality QA pairs from webpage content. We designed a structured prompt instructing the model to extract complete question text, identify relevant question-related images, and extract comprehensive solution details while preserving mathemat- ical notations and step-by-step explanations. This approach maintains the educational integrity of the extracted con- tent by preserving its original formatting, mathematical ex- pressions, and logical structure, ensuring technical accuracy throughout the extraction process. Through this method, we extracted a total of 421,320 raw QA pairs from the web- pages, with approximately 60% containing images. We then implemented a post-processing stage using the Gemini 1.5 Flash model to ensure dataset quality by eval-uating both textual content and images. Our evaluation framework assessed two key criteria: question validity and meaningfulness, as well as the relevance and clarity of question-related images. By prompting Gemini to verify whether images are properly referenced, clear, visible, and contribute to understanding the question, we established strict validation criteria for retaining QA pairs. This post-processing step significantly improved dataset quality by removing incomplete, unclear, or irrelevant con- tent while preserving educational integrity and effective- ness. Our analysis shows that out of 421,320 processed pairs, 361,015 (85.7%) were valid, while 60,305 were fil- tered out as invalid. Similarly, out of 449,859 total images processed, 331,818 (73.76%) were deemed valid and rele- vant to their corresponding questions. 3. Stage 2: Dataset Refinement After Stage 1, we obtain a large amount of raw data from the Internet. However, this data contains a notable level of noise. For instance, more than half of the questions lack cor- responding answers due to various issues, such as (1) mem- bership requirements, (2) interaction requirements, and (3) the absence of an answer. Thus, a second round of refine- ment is necessary to further improve the dataset quality. 3.1. Answer Refinement We implemented a comprehensive refinement process to en- sure consistency and quality in our dataset. This step was critical in addressing potential variations or inconsistencies in the extracted answers, thereby creating a high-fidelity dataset for model training. Our refinement methodology leveraged GPT-4o’s capa- bilities in a two-stage process. First, for each question and its associated images, we prompted GPT-4o [22]3to gen- erate four different answer variations. This approach al- lowed us to obtain multiple perspectives on each question. Next, we employed GPT-4o as an LLM judge to determine whether the synthesized responses aligned with each other. As illustrated in Figure 5, we evaluated whether the con- clusions were mutually consistent across these responses. This evaluation was particularly important for questions in domains such as mathematics and physics, where precision and correctness are paramount. Only when more than half of the synthesized responses demonstrated consistency did we retain the question along with the consistent responses. This rigorous consistency check served as an additional quality filter, ensuring that our dataset contained highly ac- curate and unambiguous answers that could be reliably used for model training. Through this refinement process, we successfully cre- ated a dataset in which all responses were systematically 3We compared GPT-4o and Gemini-1.5 and found that GPT-4o’s out- puts were significantly more reliable. Therefore, we adopted GPT-4o. 4 Page 5: generated by GPT-4o, ensuring a consistent style and level of quality throughout the collection. The resulting dataset comprises 1.04 million QA pairs spanning multiple dis- ciplines, representing one of the largest collections of consistency-verified multimodal instruction data available. 3.2. Answer Alignment The final step in our quality assurance process involved an- swer alignment to further enhance accuracy. While the pre- vious refinement step generated consistent answers using GPT-4o, we recognized the importance of validating these against authoritative content from the original web sources. SynthesizedAnswera LLMJudgeIs the majority of the answers consistent?Yes!NoOutput Nothing Here Figure 5. Illustration of our consistency checking methodology. In this step, we used Gemini-2.0-Flash to measure the alignment between GPT-generated responses and the orig- inal extracted answers, if available. In cases where the comparison indicated inconsistency, we preserved the orig- inal web-sourced answer. Conversely, when the Gemini model determined strong alignment between the generated and web-sourced answers, we retained the GPT-generated version. Through this alignment process, we combined the consistency of model-generated content with the authority of original educational materials in a balanced manner. 3.3. Dataset Statistics The statistics presented in Table 2 illustrate the distribu- tion of knowledge domains in our dataset, V ISUAL WE- BINSTRUCT . While the major categories are shown in the table, the ”Others” category (6.60%) comprises Gen- eral Knowledge (2.45%), Computer Science (2.25%), Bi- ology (1.40%), and humanities subjects, including Lan- guage/Literature (0.25%), Social Sciences (0.20%), and Arts (0.05%). This distribution reflects the dataset’s strong quantitative orientation while ensuring sufficient breadth. Table 3 summarizes the statistics after each step of the V I- SUAL WEBINSTRUCT pipeline, showing the data progres- sion through two main stages. Our approach effectively scaled the initial 30,000 seed images into a comprehensive multimodal instruction dataset containing 900K instruction data. The final dataset includes 347,313 image-associated QA pairs (approximately 38% of the total) supported by 163,743 unique images.We also conducted thorough decontamination checking to ensure our training dataset does not contain any data from the evaluation benchmarks, thereby maintaining the integrity of our experimental results. Category Percentage Num of QA Pairs Math 62.50% 566K Physics 14.50% 132K Finance 7.25% 66K Chemistry 4.80% 43K Engineering 4.35% 39K Others 6.60% 60K Table 2. Distribution of Categories in V ISUAL WEBINSTRUCT 4. Experiments We detail the training and evaluation details of our experi- ments in this section. 4.1. Experimental Setup For our experiments, we directly fine-tuned an exist- ing MAmmoTH-VL checkpoint on our V ISUAL WEBIN- STRUCT dataset. We refer to our resulting model as MAmmoTH-VL2 . The architecture consists of a language tower based on Qwen2.5-7B-Instruct [59], a vision tower using SigLip [65], and a projector module connecting these components, following Llava-OneVision [28, 33]. To enhance data diversity, we employed a data mixing strategy that combined our V ISUAL WEBINSTRUCT dataset with modified LLaV A-CoT data [57] (with CoT prompt- ing tags removed) in a 9:1 ratio, resulting in approxi- mately 900K samples from V ISUAL WEBINSTRUCT and 100K samples from the modified LLaV A-CoT dataset. This mixing strategy empirically improved our model’s perfor- mance across diverse visual reasoning tasks. We employed a supervised fine-tuning (SFT) approach with a batch size of 256. The learning rate was set to 1×10−5for the language model and projector compo- nents, while the vision encoder was fine-tuned with a lower rate of 2×10−6to preserve its pre-trained visual recogni- tion capabilities. The model was trained for a single epoch, which proved sufficient given the high quality and diversity of our dataset. Input images were processed at a resolu- tion of 384×384with appropriate adjustments for varied aspect ratios. We limited input sequences to a maximum of 8,192 tokens to accommodate detailed reasoning chains while maintaining computational efficiency. This fine-tuning approach enabled MAmmoTH-VL2 to leverage the strong multimodal reasoning foundation of MAmmoTH-VL while enhancing its performance on our targeted visual reasoning tasks that require multi-step de- liberation with visual context. 5 Page 6: Processing Stage Total QA Pairs Image-Associated QA Unique Questions Total Images Unique Images Stage 1: Mining Data from the Internet Seed Data Collection - - - 30,000 30,000 QA Pairs Extraction 421,320 248,643 421,320 552,269 362,728 Post-Processing 361,015 159,059 361,015 331,818 212,530 Stage 2: Dataset Refinement Answer Refinement 1,041,598 407,218 257,201 577,455 167,493 Answer Alignment 906,160 347,313 257,201 475,099 163,743 Table 3. Statistics of different milestones in the data processing pipeline of V ISUAL WEBINSTRUCT . Model Size MMMU MMMU-Pro MMMU-Pro MathVista MMVet MathVerse Dyna-Math Avg val standard vision testmini test testmini test Closed-sourced Models GPT-4o - 69.1 54.0 49.7 63.8 76.2 50.2 63.7 61.0 Gemini-1.5-Pro - 59.1 49.4 65.8 63.9 64.0 41.2 64.8 58.3 Claude-3.5-Sonnet - 68.3 55.0 48.0 67.7 75.4 44.2 60.5 59.9 Open-source General Vision-Language Models Molmo 8B 45.3 28.3 18.9 51.6 58.0 18.9 41.6 37.5 Llava-OV 7B 48.8 29.5 18.7 63.2 58.6 26.2 40.3 40.8 Llama-3.2-Inst 11B 50.7 33.0 23.7 51.5 59.3 31.6 40.5 41.5 Qwen2-VL 7B 52.1 37.0 26.9 58.2 62.0 28.2 42.1 43.8 MAmmoTH-VL 7B 50.8 33.2 25.3 66.0 62.3 34.2 44.7 45.2 InternVL2.5 7B 55.8 38.2 30.4 64.4 62.8 39.5 49.8 48.7 Phi-4-mini 5.6B 55.1 39.7 31.2 62.4 60.5 37.6 51.4 48.6 DeepSeek-VL2 27B 51.1 31.4 24.3 62.8 - - - - Llava-CoT-L 11B 50.1 31.6 20.4 54.8 60.3 30.2 44.8 41.7 Llava-CoT-M 7B 51.4 33.0 23.7 63.8 58.6 39.4 48.3 45.5 LlamaV-o1 11B 49.1 31.5 22.4 54.4 63.6 - - - Mulberry 7B 55.0 36.8 23.6 63.1 60.9 31.0 45.1 45.0 Insight-V 8B 50.2 30.7 20.5 59.9 60.8 28.7 47.8 42.6 MM-Eureka 8B 49.2 - - 67.1 60.7 40.4 - - MAmmoTH-VL2 7B 54.7 40.7 26.3 68.1 64.5 42.6 55.7 50.4 ∆over SoTA -1.1 +1.0 -4.9 +2.1 +0.9 +3.1 +4.3 +1.7 Table 4. Evaluation Results of our model and other baseline models. Most of the baseline results are taken from other papers. The best and second-best results across all open-source models are highlighted in bold and underlined, respectively. 4.2. Evaluation Setup To assess the capabilities of MAmmoTH-VL2 , we con- ducted a comprehensive evaluation across multiple multi- modal benchmarks that specifically test visual reasoning and knowledge application. Our evaluation framework fo- cuses on benchmarks that require complex reasoning with visual context. We evaluate our model on seven key bench- marks that collectively provide a comprehensive assessment of multimodal reasoning capabilities: •MMMU [62] : Tests multimodal understanding across university-level domains, requiring integration of visual and textual information. •MMMU-Pro [63] : Advanced versions of MMMU with more challenging problems and more distractor options that require sophisticated visual reasoning.•MathVista [38] : Evaluates mathematical reasoning with visual inputs, testing the model’s ability to process visual information for solving complex math problems. •MMVet [61] : Assesses general multimodal understand- ing across diverse tasks and contexts. •MathVerse [68] : Focuses on mathematical reasoning with visual components and relies less on text hints, re- quiring complex visual reasoning. •Dynamath [70] : Tests dynamic mathematical reasoning capabilities with visual context. For all evaluations, we used greedy decoding in a zero- shot setting to ensure fair comparison with existing mod- els. We categorize the comparison models into three groups: closed-source models (GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet), open-source vision-language models 6 Page 7: (Qwen2-VL [53], LLaV A-OV [28], Molmo [13], etc.), and reasoning-enhanced vision-language models (LLaV A- CoT [57], Mulberry [60], etc). We include Llava-CoT-L, which is trained from Llama-3.2 [18] and Llava-CoT-M, which is trained from MAmmoTH-VL [19]. To ensure standardized and reproducible evaluations, we employed LMMsEval [66], a comprehensive evaluation framework for multimodal language models. For all evalua- tions, we used greedy decoding in a zero-shot setting to en- sure fair comparison with existing models. Our approach al- lows for direct comparison with models of comparable size, providing insights into the value of the V ISUAL WEBIN- STRUCT dataset. Performance is reported using accuracy scores for each benchmark, with an average score across all benchmarks to indicate overall model capability. 4.3. Experimental Results Here we evaluate our results from different perspectives. Quantitative Results The table 4 presents the performance ofMAmmoTH-VL2 compared to various multimodal mod- els across seven benchmarks. Our analysis reveals several important findings regarding the effectiveness of models fine-tuned on V ISUAL WEBINSTRUCT . Overall Performance. MAmmoTH-VL2 achieves an aver- age accuracy of 50.4% across all benchmarks, outperform- ing other open-source vision-language models of compa- rable size (7B-11B parameters). This represents a signif- icant improvement over standard vision-language models like Qwen2-VL (43.8%), LLaV A-OV (40.8%), and Molmo (37.5%). It even beats the very recent model like In- ternVL2.5 [9] and Phi-4-mini-Multimodal [1]. Mathematical Reasoning Capabilities. MAmmoTH-VL2 demonstrates particularly strong performance on mathemat- ical reasoning tasks. On MathVista, our model achieves 68.1% accuracy, surpassing all the open-source and closed- source models. The model’s performance on MathVerse (42.6%) and Dyna-Math (55.7%) further confirms its en- hanced capability for visual reasoning. Complex Reasoning Tasks. On MMMU-Pro-std with 10 options, MAmmoTH-VL2 achieves 40.7% accuracy, show- ing a significant improvement over other 7B models such as LLaV A-OV (29.5%) and Qwen2-VL (37.0%). This demon- strates that our approach effectively enhances the model’s ability to perform complex reasoning across diverse do- mains beyond mathematics. Gap with Larger and Closed-Source Models. While MAmmoTH-VL2 outperforms open-source models of com- parable size, there remains a gap with closed-source models such as GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. This indicates potential for further improvements through scaling or enhanced training methodologies. Comparison with Reasoning-Enhanced Models. Among the reasoning-enhanced vision-language models like Llava-CoT, Mulberry [60], LlamaV-o1 [51] and Insight-V [14], MAmmoTH-VL2 demonstrates competitive performance, achieving results comparable to or better than specialized models like LLaV A-CoT and Mulberry. For instance, on MMMU-Pro Vision, our model achieves 26.3% accu- racy, outperforming LLaV A-CoTM’s 23.7%. Notably, other reasoning-enhanced models often utilize complex method- ologies in either the training or inference stage to enhance their chain-of-thought abilities, which makes the develop- ment process and deployment more complicated. In con- trast, MAmmoTH-VL2 achieves much better reasoning ca- pabilities through our straightforward fine-tuning approach on V ISUAL WEBINSTRUCT , offering a simpler yet effective solution compared to the other approaches. These results confirm that fine-tuning on V ISUAL WE- BINSTRUCT significantly enhances the model’s reason- ing capabilities. The consistent performance improve- ments across diverse benchmarks from non math-related and math-related domains demonstrate the effectiveness of our approach in developing more capable multimodal rea- soning models. We believe our dataset can be utilized to augment future vision-language models. 4.4. Ablation Study The ablation study in Table 5 demonstrates the impact of different training datasets and their combinations on model performance across multiple visual reasoning bench- marks. Two base models were evaluated: Llava-OV-mid and MAmmoTH-VL. For Llava-OV-mid, the baseline starts at 26.3% average score across benchmarks. Training with Llava-CoT data improves this to 33.6%, while training on V ISUAL WEBIN- STRUCT yields an even better 38.1%, with with MMVet per- formance notably jumping from 32.1% to 57.6%. The com- bined training approach (V ISUAL WEBINSTRUCT +Llava- CoT) achieves the best overall performance at 39.7%. The stronger MAmmoTH-VL model begins with an av- erage score of 45.4%. Training with V ISUAL WEBIN- STRUCT improves the average to 49.0%, showing gains across multiple benchmarks, particularly in MMMU-Pro vi- sion and Dyna-Math tests. As with Llava-OV-mid, the com- bined training approach works best, reaching 50.4% aver- age score, with notable improvements in MMMU (54.7%), MMMU-Pro standard (40.7%), and Dyna-Math (55.7%). The key findings indicate strong data complementarity between V ISUAL WEBINSTRUCT and Llava-CoT, with their combination consistently delivering the best results. We also observe that weaker base models show larger relative improvements from training. Overall, the ablation study confirms that our V ISUAL WEBINSTRUCT dataset signifi- cantly boosts model performance across all benchmarks, demonstrating its effectiveness in enhancing visual reason- ing capabilities regardless of the base model. 7 Page 8: Training Data MMMU MMMU-Pro MMMU-Pro MathVista MMVet MathVerse Dyna-Math Avg val standard vision testmini test testmini test Training from LLava-OV-mid - 40.1 21.2 12.2 36.0 32.1 18.1 24.4 26.3 Llava-CoT 40.8 25.8 14.6 45.7 47.5 27.2 33.9 33.6 Ours 45.3 31.5 20.9 43.9 57.6 27.4 40.3 38.1 Ours+Llava-CoT 47.6 31.6 20.9 48.8 51.7 34.9 42.3 39.7 Training from MAmmoTH-VL - 50.8 34.8 25.3 66.0 62.3 34.2 44.7 45.4 Llava-CoT 51.4 35.2 24.6 63.8 58.7 39.4 48.3 45.9 Ours 52.6 38.6 29.0 65.9 61.8 39.4 55.7 49.0 Ours+Llava-CoT 54.7 40.7 26.3 68.1 64.5 42.6 55.7 50.4 Table 5. Ablation Results of our experiments. We show experimental results from different backbones to show the impact of consistency filtering and data mixing with Llava-CoT. For each base model, the best performance is highlighted in bold. 5. Related Works 5.1. Multimodal Instruction Data Creating high-quality multimodal datasets remains a sig- nificant challenge in advancing MLLMs. Current ap- proaches face critical limitations, particularly in balanc- ing quality and scale. Human-annotated datasets provide high-precision, contextually appropriate data [13, 42, 48, 58] but suffer from prohibitive costs and scalability con- straints. Meanwhile, methods leveraging existing academic datasets [34, 52] offer more cost-effective alternatives but lack the diversity and reasoning complexity needed for ad- vanced multimodal reasoning tasks. This limitation is par- ticularly evident in the scarcity of large-scale, reasoning- focused multimodal datasets that can be efficiently pro- duced. Our work addresses these challenges by proposing a novel, scalable methodology for constructing multimodal instruction datasets that maintain both the quality and rea- soning complexity. 5.2. Multimodal Large Language Models Multimodal Large Language Models (MLLMs) have ad- vanced AI by integrating text and visual processing capa- bilities. While proprietary models such as GPT-4o [22] and Gemini [49, 50] achieve state-of-the-art performance, they remain inaccessible to the broader research community. To address this gap, connector-based approaches [12, 30] have emerged, linking visual encoders to language models through lightweight projection modules. Recent open-source MLLMs, such as LLAMA [18], LLaV A [27, 32], MiniGPT-4 [69], and Deepseek-VL [35], have contributed to advancements in vision-language un- derstanding. Additionally, Qwen-VL [53] and InternVL [9] have demonstrated strong performance through efficient de- sign and diverse pre-training. Meanwhile, various approaches have been developedto enhance MLLM reasoning capabilities, including neu- ral symbolic methods [2, 10], optimized visual encoding strategies [23, 31], plan-based prompting [39, 43], struc- tured reasoning frameworks [57], and sequential instruc- tion tuning [21]. Despite these advancements, these models face a critical challenge: the scarcity of publicly available large-scale visual reasoning datasets necessary for enhanc- ing model reasoning capabilities [4]. Our work addresses this supervised fine-tuning data bottleneck while building on the connector-training paradigm, aiming to bridge the gap between proprietary and open-source multimodal mod- els to foster more accessible vision-language systems. 5.3. Chain-of-Thought in Large Language Models Chain-of-Thought (CoT) prompting [55] has revolution- ized how large language models tackle complex reason- ing challenges. This technique enables LLMs to navi- gate difficult problems—including commonsense scenar- ios [17, 46] and logical puzzles [29, 56]—by following ex- plicit reasoning pathways. At its core, CoT methodically decomposes complex questions into manageable sequen- tial steps, creating a structured framework that guides mod- els toward systematic solutions [11]. Evidence consistently demonstrates significant improvements in reasoning perfor- mance through this approach. Notable advancements in- clude Prism [44], which implements a distinctive dual-stage architecture that separates initial perception from subse- quent reasoning operations, and MSG [5], which pioneered the forced Chain-of-Thought methodology—establishing a foundational paradigm shift in structured prompting ap- proaches that continues to shape current research. 6. Conclusion In this paper, we explore the possibility of constructing large-scale multimodal reasoning datasets without relying 8 Page 9: on human annotation. We are the first paper to utilize Google Image Search for mining high-quality visual rea- soning dataset. Our approach has been highly effective to achieve state-of-the-art performance on 5 out of 7 evaluated benchmarks. In the future, we plan to work on multiple round of search to further expand the dataset size. References [1] Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen- Chun Chen, Yi ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mah- moud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yim- ing Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Wei- jian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, and Xiren Zhou. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras, 2025. 2, 7 [2] Saeed Amizadeh, Hamid Palangi, Oleksandr Polozov, Yichen Huang, and Kazuhito Koishida. Neuro-symbolic vi- sual reasoning: Disentangling ”visual” from ”reasoning”, 2020. 8 [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015. 1 [4] Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, and Wen- tao Zhang. A survey of multimodal large language model from a data-centric perspective, 2024. 8 [5] Franz Louis Cesista. Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2025. 8 [6] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545 , 2022. 1 [7] Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geo- metric question answering benchmark towards multimodal numerical reasoning, 2022. 1, 2 [8] Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, et al. Mega-bench: Scaling multi-modal evaluation to over 500 real-world tasks. arXiv preprint arXiv:2410.10563 , 2024. 1 [9] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271 , 2024. 2, 7, 8 [10] Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, and Sandeep Chinchali. Towards Neuro- Symbolic Video Understanding , page 220–236. Springer Na- ture Switzerland, 2024. 8 [11] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a sur- vey of chain of thought reasoning: Advances, frontiers and future, 2024. 8 [12] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning, 2023. 8 [13] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models, 2024. 7, 8 [14] Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432 , 2024. 7 [15] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neu- ral Information Processing Systems , 36:70757–70798, 2023. 1 [16] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geomet- ric problem with multi-modal large language model, 2023. 1, 2 [17] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021. 8 [18] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha 9 Page 10: Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. 7, 8 [19] Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237 , 2024. 2, 7 [20] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018. 1 [21] Hanxu Hu, Simon Yu, Pinzhen Chen, and Edoardo M. Ponti. Fine-tuning large language models with sequential instruc- tions, 2024. 8 [22] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. 2, 4, 8 [23] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing, 2024. 8 [24] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2901–2910, 2017. 1, 2 [25] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, ´Akos K ´ad´ar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 , 2017. 1, 2 [26] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016. 1, 2 [27] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 , 2024. 8 [28] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 , 2024. 2, 5, 7 [29] Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, and Jie Chen. Weakly-supervised 3d spatial reasoning for text-based visual question answering. IEEE Transactions on Image Pro- cessing , 32:3367–3382, 2023. 8 [30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 8 [31] Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm, 2024. 8 [32] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 8[33] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36:34892–34916, 2023. 1, 5 [34] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too, 2023. 8 [35] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision- language understanding, 2024. 8 [36] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning. In The 59th Annual Meeting of the Associa- tion for Computational Linguistics (ACL) , 2021. 2 [37] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214 , 2021. 2 [38] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 , 2023. 1, 2, 6 [39] Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wen- gang Zhou, and Houqiang Li. Textcot: Zoom in for enhanced multimodal text-rich image understanding, 2024. 8 [40] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 , 2022. 1, 2 [41] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision , pages 2200–2209, 2021. 1 [42] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xi- anzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu H`e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang. Mm1: Methods, analysis & insights from multimodal llm pre- training, 2024. 8 [43] Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models, 2024. 8 [44] Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms, 2024. 8 [45] Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries , 23(3):289–301, 2022. 1, 2 10 Page 11: [46] Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Ab- stracts , pages 27–33, Online, 2020. Association for Compu- tational Linguistics. 8 [47] Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math- llava: Bootstrapping mathematical reasoning for multimodal large language models, 2024. 2 [48] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf, 2023. 8 [49] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530 , 2024. 1, 8 [50] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530 , 2024. 8 [51] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186 , 2025. 7 [52] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. 8 [53] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 7, 8 [54] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems , 35:24824–24837, 2022. 1 [55] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. 8 [56] Siheng Xiong, Yuan Yang, Ali Payani, James C Kerce, and Faramarz Fekri. Teilp: Time prediction over knowledge graphs via logical reasoning, 2024. 8 [57] Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2025. 2, 5, 7, 8[58] Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang. Vision-flan: Scaling human-labeled tasks in visual instruc- tion tuning, 2024. 8 [59] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 , 2024. 5 [60] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Em- powering mllm with o1-like reasoning and reflection via col- lective monte carlo tree search, 2024. 2, 7 [61] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In International Conference on Machine Learn- ing, pages 57730–57754. PMLR, 2024. 6 [62] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 1, 2, 6 [63] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813 , 2024. 6 [64] Xiang Yue, Tianyu Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. Advances in Neural Information Processing Systems , 37:90629–90660, 2025. 2 [65] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision , pages 11975–11986, 2023. 5 [66] Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 7 [67] Lei Zhang and Yong Rui. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) , 9 (1s):1–20, 2013. 2 [68] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In Euro- pean Conference on Computer Vision , pages 169–186, 2024. 6 [69] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 8 11 Page 12: [70] Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vi- sion language models. In The Thirteenth International Con- ference on Learning Representations , 2024. 2, 6 12 Page 13: 7. Supplementary Material 7.1. Image Number Distribution 0 images 1 image 2 images 3 images 4 images 5+ images0102030405060Percentage of Entries558847 (61.7%) 283516 (31.3%) 34830 (3.8%) 12059 (1.3%) 8402 (0.9%) 8506 (0.9%)Number of Images Distribution in VisualWebInstruct Total Entries: 906160 7.2. MAmmoTH-VL2 Training Configuration Model Architecture Data Processing Base Language Model Qwen/Qwen2.5-7B-Instruct Image Aspect Ratio anyres max 4 Vision Encoder google/siglip-so400m-patch14-384 Image Grid Pinpoints (1x1),...,(6x6) Vision-Language Connector MLP-based projector (2-layer with GELU) Group by Modality Enabled Vision Select Layer -2 (second-to-last layer) Image Start/End Tokens Disabled Patch Merge Type spatial unpad Image Patch Token Disabled Starting Checkpoint MAmmoTH-VL Lazy Preprocessing Enabled Training Configuration Dataset Configuration Training Epochs 1 Primary Dataset VisualWebInstruct Batch Size 256 Additional Dataset LLaV A-CoT (9:1 ratio) Maximum Sequence Length 8,192 tokens Prompt Template qwen 25 Learning Rate 1e-5 (language and projector) Vision Tower Learning Rate 2e-6 Weight Decay 0.0 Warmup Ratio 0.03 LR Scheduler Cosine Tunable Components Optimization Language Model Enabled Distributed Training DeepSpeed Zero-3 Vision Tower Enabled TF32 Precision Enabled MLP Adapter Enabled Mixed Precision BF16 Gradient Checkpointing Enabled TF32 Precision Enabled Torch Compile Enabled (inductor) 13 Page 14: 7.3. Prompt for Each Stage QA Pairs Extraction """Analyze this webpage content and extract questions, images, and complete solution details in Markdown format. Please format your response as follows: **Question 1: ** [complete question text] **Images: ** *[First image URL if available] *[Second image URL if available] [continue for each additional image...] **Solution: ** [Copy the complete solution text from the webpage, including all steps, explanations, and calculations] **Images in Solution: ** *[First image URL if available] *[Second image URL if available] [continue for each additional image...] [repeat for each additional question...] Requirements: - Keep the complete solution text exactly as shown in the webpage - Use Markdown formatting throughout the response - Mark missing content as "Not found" - For images, include URL only - For multiple questions, number them sequentially - Do not summarize or modify the solution text - Preserve all mathematical notations and formulas - Keep all step-by-step explanations intact - Preserve all line breaks and indentation in solution text - If there is no question in the content, mark it as "Not found" - If the webpage is empty or missing, return nothing Webpage content: {Accessibility Tree } """ 14 Page 15: QA Pairs Validation """Please analyze this question-answer pair and its images: Question: complete question text Solution: complete solution text Your tasks: 1. Determine if the question is meaningful and valid. 2. For the question images (if any), determine if each is: - Properly referenced in the question - Clear and visible - Actually helps understand the question 3. For the solution images (if any), determine if each is: - Helps explain the solution Notes: - Image indices start from 0 (e.g., first image is index 0, second is index 1, etc.) - Images should be marked as valid if they show the actual content being discussed - Images should be marked as invalid only if they are: *Completely irrelevant to the question/solution *Corrupted or unreadable *Duplicate or redundant Question Images: [Images loaded here] Solution Images (starting a new section, indexes reset to 0): [Images loaded here] Please respond in this exact format: QUESTION VALID: [yes/no] ANALYSIS: [Brief explanation of why the question is valid/invalid] QUESTION IMAGES: [comma-separated list of valid image indices starting from 0] QUESTION IMAGES REASON: [Brief explanation for each image decision] SOLUTION IMAGES: [comma-separated list of valid image indices starting from 0] SOLUTION IMAGES REASON: [Brief explanation for each image decision] CRITICAL RESPONSE FORMAT INSTRUCTIONS: - You MUST respond using EXACTLY this format with no additional text - Use ONLY numeric indices for images, starting from 0 - If no images are valid, use an empty string - Be precise and use actual numbers - Always use numeric indices (0,1,2...) - Use empty string for no images (e.g., "SOLUTION IMAGES: ") - Do not add explanatory text in the indices field """ 15 Page 16: Answer Alignment """Given the question and the provided image(s), compare these two answers and determine if they are aligned. Question: question GPT’s Answer: gptanswer Real Answer: realanswer Example of Aligned Answers: Question: What is 2 + 2? GPT Answer: 4 Real Answer: 4 Example of Misaligned Answers: Question: What is derivative of x2? GPT Answer: 2x + 1 Real Answer: 2x Are these answers aligned? Respond with just ’Yes’ or ’No’ on the first line. Provide a brief explanation on the second line.""" 16 Page 17: 7.4. Example of QA Pair Question:How many angles does a quadrilateral have? A. 1 B. 2 C. 3 D. 4Web Answer:Right Answer is: D SOLUTION An angle is made up of two rays starting from a common initial point (or a common vertex). A quadrilateral has 4 angles. Consider quadrilateral ABCD. It can be seen here that there are 4 angles here; ∠A , ∠B , ∠C and ∠D .GPT Answer1:To determine how many angles a quadrilateral has, let's reason through it step by step: 1. **Definition of a Quadrilateral**: A quadrilateral is a polygon with four sides. 2. **Understanding Angles**: In any polygon, the number of angles is equal to the number of sides because each angle is formed where two sides meet. 3. **Applying the Definition**: Since a quadrilateral has four sides, by definition, it will also have four angles. 4. **Double-check with Examples**: Consider common examples of quadrilaterals like squares, rectangles, and diamonds—all these shapes have four angles. So, based on the above reasoning, the quadrilateral depicted in the image has four angles. Answer: D. 4 GPT Answer2: To determine how many angles a quadrilateral has, let's break it down step by step: 1. **Definition of a Quadrilateral**: A quadrilateral is a polygon with four sides. 2. **Interior Angles in a Quadrilateral**: By definition, any polygon with four sides (a quadrilateral) will have four interior angles. 3. **Examples and Properties**: -Consider a square, rectangle, or any general quadrilateral. Each has four corners, and at each corner, there is an angle formed between two sides. -A rectangle, for example, has four right angles. A rhombus has four angles, the sum of which is also 360 degrees. 4. **Summation of Angles**: The sum of the interior angles of any quadrilateral is 360 degrees. This sum indicates there must be four angles, as each angle contributes to this total. 5. **Checking against other Shapes**: -A triangle, which is a three-sided polygon, has three angles. -Extending to more sides, a pentagon (five-sided polygon) has five angles. -This shows the number of angles is directly determined by the number of sides of the polygon. Based on the above reasoning, a quadrilateral has 4 angles. Hence, the correct answer is: `Answer: 4`GPT Answer3: A quadrilateral is a polygon with four sides. By definition, a polygon has the same number of angles as it does sides. Therefore: 1. A quadrilateral has four sides. 2. Since it has four sides, it will also have four angles. Thus, a quadrilateral has four angles. Answer: D. 4GPT Answer4: A quadrilateral is a polygon with four sides. By definition, any polygon with \( n \) sides will also have \( n \) angles. Since a quadrilateral has 4 sides, it must also have 4 angles. Let's break this down: 1. Identify the shape: The given shape is a quadrilateral. 2. Determine the number of sides: A quadrilateral has 4 sides. 3. Determine the number of angles: In any polygon, the number of angles is equal to the number of sides. Therefore, a quadrilateral has 4 angles. Answer: D. 4 17

---