loader
Generating audio...

arxiv

Paper 2503.10427

VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan

Authors: Zhi Rui Tam, Ya-Ting Pai, Yen-Wei Lee

Published: 2025-03-13

Abstract:

In this paper, we propose a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: (1) VisTai-MCQ, a collection of manually curated exam multi-choice questions from 21 academic subjects designed to test the broad knowledge and reasoning capabilities of VLMs; and (2) VisTai-Dialogue, an open dialogue benchmark comprising 131 image-question pairs manually created to evaluate VLMs' ability in free-form dialogue generation within Taiwanese cultural contexts. These benchmarks address a critical gap in the evaluation landscape, where existing benchmarks predominantly focus on English or Simplified Chinese, neglecting the unique linguistic and cultural aspects of Traditional Chinese used in regions like Taiwan and Hong Kong. Our analysis reveals significant performance differences across various VLMs and highlights specific challenges in processing Traditional Chinese visual content.

Paper Content:
Page 1: Preprint. Under review. VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan Zhi Rui Tam Independent Researcher Ya-Ting Pai∗ University of Illinois Urbana-Champaign Yen-Wei Lee∗ Independent Researcher Abstract In this paper, we propose a comprehensive evaluation benchmark for Vi- sual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: (1) VisTai- MCQ, a collection of manually curated exam multi-choice questions from 21 academic subjects designed to test the broad knowledge and reasoning capabilities of VLMs; and (2) VisTai-Dialogue, an open dialogue benchmark comprising 131 image-question pairs manually created to evaluate VLMs’ ability in free-form dialogue generation within Taiwanese cultural contexts. These benchmarks address a critical gap in the evaluation landscape, where existing benchmarks predominantly focus on English or Simplified Chi- nese, neglecting the unique linguistic and cultural aspects of Traditional Chinese used in regions like Taiwan and Hong Kong. Our analysis reveals significant performance differences across various VLMs and highlights specific challenges in processing Traditional Chinese visual content.1 1 Introduction Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning, visual question answering (VQA), and cross-modal retrieval (Chen et al., 2022; Li et al., 2022; Faysse et al., 2024). However, current multimodal benchmarks predominantly focus on English, and when extended to Chinese, they primarily employ Simplified Chinese scripts common in mainland China (Das et al., 2024; Winata et al., 2024; Wang et al., 2024). Traditional Chinese, widely used in Taiwan and Hong Kong, remains significantly under- represented. Consequently, VLM capabilities in Traditional Chinese contexts are largely untested and unknown. Evaluating VLMs in Traditional Chinese presents unique challenges. Traditional Chinese characters are more complex and often do not map directly to their Simplified counterparts, causing potential inaccuracies when models trained on Simplified scripts encounter Tra- ditional texts (Chen et al., 2024a). Moreover, images from regions like Taiwan frequently incorporate culturally specific content and Traditional Chinese textual elements—such as signs, labels, or documents—introducing further contextual challenges not typically ad- dressed by Western-centric or Simplified-Chinese-oriented datasets. Figure 1 shows that Thus, dedicated benchmarks capturing both linguistic complexity and cultural nuances are essential for accurate evaluation. ∗Equal contribution 1Our code and dataset can be found in https://github.com/TMMMU-Benchmark/evaluation 1arXiv:2503.10427v1 [cs.CL] 13 Mar 2025 Page 2: Preprint. Under review. Figure 1: VisTai contain 2 subset : MCQ includes multi choice question answering exams from 21 subjects, Dialogue subset is a set of real-life capture images and question 1.1 Our Contributions To fill this gap, we introduce two complementary benchmarks for evaluating VLMs in Traditional Chinese and Taiwanese contexts: Vision and Language in Taiwan specifically VisTai-MCQ and VisTai-Dialogue . VisTai-MCQ. This benchmark centers on structured, exam-style questions, reflecting high-level reasoning and subject-matter knowledge. Inspired by Taiwanese educational assessments, VisTai-MCQ features images (e.g., diagrams, charts) coupled with questions in Traditional Chinese. These questions often require interpretative or logical reasoning, offering a rigorous test bed for multimodal models beyond simple image-description tasks. VisTai-Dialogue. In contrast, VisTai-Dialogue targets open-ended scenarios rooted in everyday Taiwanese contexts. It includes image-question pairs on topics such as local landmarks, cultural practices, and daily life scenes containing Traditional Chinese text (e.g., street signs, menus). Rather than multiple-choice prompts, VisTai-Dialogue uses free-form queries, requiring models to produce answers in Traditional Chinese and handle unconstrained, real-world queries. In summary, our contributions include: •The first comprehensive vision-language benchmarks specifically tailored to Tradi- tional Chinese, covering both academic-style and open-ended real-world scenarios. •A curated, high-quality, human-annotated dataset addressing significant gaps in multilingual multimodal AI evaluation. •An analysis of model performance highlighting challenges specific to Traditional Chinese settings, expanding understanding beyond well-established English and Simplified Chinese evaluations. 2 Related Work 2.1 English-based Multimodal Benchmarks Vision-language benchmarks like ChartQA (Masry et al., 2022), TextVQA (Singh et al., 2019), and DocVQA (Mathew et al., 2021) have advanced visual text understanding through specialized tasks requiring OCR and reasoning. More recent exam-based evaluations such as MMMU (Yue et al., 2024) test expert-level multimodal reasoning across 30 subjects, pushing beyond surface-level image understanding to complex problem-solving. For open- ended generation, benchmarks like Vibe-Eval (Padlewski et al., 2024) assess multimodal 2 Page 3: Preprint. Under review. chat models with 269 visually-grounded prompts. However, these benchmarks remain English-centric with limited applicability to other languages and cultural contexts. 2.2 Traditional Chinese and Multilingual Benchmarks For Traditional Chinese, recent benchmarks focus primarily on text-only evaluation. TMMLU (Hsu et al., 2023) provides 3,300 multiple-choice questions from Taiwanese ex- ams, while TMMLU+ (Tam et al., 2024) expands this to 22,690 questions across 66 subjects. However, both lack multimodal components essential for vision-language evaluation. TMM- Bench included in the release of Breeze2 (Hsu et al., 2025) is a multichoice traditional Chinese vision question that covers topics related to Taiwan, such as Taiwanese attractions, daily life, and Taiwan’s university entrance exams. However, as of this writing, the details and TMMBench dataset have not been publicly released yet. Multilingual efforts include JMMU(Onohara et al., 2024) for Japanese (though with only 30 questions per subjects) and CMMU(He et al., 2024) for Simplified Chinese (with limited public access and insufficient cultural context). M3Exam(Zhang et al., 2023) covers 9 languages, including Simplified Chinese, but uses generic curricula without Traditional Chinese script or Taiwan-specific content. This review reveals a critical gap: There is no comprehensive benchmark for evaluating vision-language models in Traditional Chinese that address both structured reasoning and open-ended generation while incorporating culturally relevant Taiwanese contexts. Our work on VisTai-MCQ and VisTai-Dialogue aims to address this specific need in the field. 3 VisTai-MCQ : Visual Multi-Choice Question answering The multi-choice question answering format is widely utilized to assess the knowledge and reasoning capabilities of Large Language Models (LLMs). In this section, we introduce a new visual multi-choice benchmark, VisTai-MCQ , composed of past examination questions from various educational levels in Taiwan, ranging from primary and secondary education to specialized undergraduate courses, such as veterinary medicine. Our benchmark dataset was constructed using real-world exam papers collected from pub- licly available sources spanning the years 2013 to 2024. We selected subjects specifically requiring visual comprehension, such as medical diagnostics (e.g., interpreting X-ray and ultrasound images), geometry, electronic circuit design, and chemistry. The curation pro- cess involved manual extraction of questions and images from official exam archives and online PDFs. Each pair of questions and images was independently reviewed by a second annotator to verify content accuracy, clarity, and completeness, ensuring the absence of contextual errors, missing information, or invalid answer choices. To maintain consistent baseline guessing performance, the answer choices for each question were randomly shuf- fled, ensuring a uniform probability (approximately 25%) of selecting the correct answer by chance. In total, VisTai-MCQ includes 21 distinct subjects covering diverse areas such as medical diagnosis, nautical science with cartographic analysis, and technical disciplines. Each subject contains a minimum of 100 test questions, along with 5 few-shot learning examples as a development set, and 10 questions dedicated to validation purposes. For more details on image resolutions and aspect ratio we include this in Appendix C.2. Table 1 compares VisTai-MCQ with existing visual multi-choice benchmarks, such as CMMU (He et al., 2024), JMMMU (Onohara et al., 2024), and ALM-Bench (Vayani et al., 2024). Our dataset stands out due to its robustness and broader subject diversity. 3.1 Evaluation Method To evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) across different subjects, we employ a zero-shot Chain-of-Thought prompting (Kojima et al., 2022). Each VLM is instructed to provide reasoning steps followed by the final answer choice explicitly. However, we observed that certain VLMs occasionally struggled to comply with 3 Page 4: Preprint. Under review. Dataset Language Total Subjects Test Validation VisTai-MCQ Traditional Chinese 21 3,795 869 CMMU Simplified Chinese 7 1,803 1,800 MMMU English 28 1,320 900 ALM-Bench Traditional Chinese 13 52 0 Table 1: Comparison of VisTai-MCQ with other similar datasets, for ALM-Bench, we only compare the subset for Traditional Chinese. formatting instructions, resulting in invalid responses. To mitigate evaluation bias due to parsing errors, we subsequently employed an auxiliary LLM parser ( gpt-4o-mini ) to extract and validate the final answer choice from each VLM response. We report the average scores for all 21 subjects as our final score. However, one limitation of MCQA form of benchmark is that it does not reflect the real world use of VLM which mostly in free form question answering format. Hence we’ll address this limitation by introducing another benchmark dataset. 4 VisTai-Dialogue: Visual Free Form Dialogue Benchmark In this section, we introduce a visual free-form generation benchmark designed to bridge the gap between real-world user interactions and typical model evaluation procedures. Specifically, our goal is design a dataset which reflects real user experiences when interacting with VLMs in Traditional Chinese, where users naturally engage in open-ended dialogues rather than structured question-answering formats. We manually curated both images and the corresponding dialogue prompts based on real-life scenarios encountered in Taiwan, such as interpreting metro rail maps or calcu- lating shared bills from restaurant menus. Each image was intentionally captured by our researchers to ensure that these image has never been seen during VLM training. The researchers who took the photographs also tasked to write challenging questions related to the images, specifically designed to test spatial reasoning skills and local cultural knowledge unique to Taiwan. To facilitate efficient data collection across diverse regions, we created a Discord server integrated with a data collection bot. We chose Discord as our data collection platform for several practical reasons: Participants already had Discord installed, it allowed direct photo uploads from mobile devices, and its threaded conversation structure facilitated discussion. Contributors, primarily existing Discord users, only needed to join the dedicated server to participate. The custom bot we developed served multiple functions: It notified users when data was received successfully, monitored emoji annotations for quality control, and provided a testbed for experimenting with various rating prompts within submission threads. Additional details of the labeling can be found in the Appendix B. Following the initial submission, our quality control process involved multiple stages: 1. Answer Provision: The researcher who submitted the image was required to provide an answer to their question within a newly created submission thread. 2. Peer Review: Other team members reviewed the provided answers, making corrections or improvements when necessary. 3. Community Moderation: We implemented a simple but effective quality control mech- anism using Discord’s emoji reactions. Members could flag inappropriate or unsuitable submissions using downvotes, which were automatically recorded by our bot. 4. Filtering Process: Submissions that received more than two downvotes (typically from a user and a moderator) were removed during the dataset cleaning phase. 4 Page 5: Preprint. Under review. VLM Judge With Image Without Image claude-3-5-sonnet-20241022 0.8336 0.8282 gemini-2.0-flash-001 0.8241 0.8194 gpt-4o-2024-08-06 0.7985 0.8277 gemini-2.0-pro-exp-02-05 0.8196 0.8320 qwen2.5-vl-72b-instruct 0.7501 0.7341 gpt-4o-mini-2024-07-18 0.7349 0.7360 Table 2: Spearman correlations between human-assigned scores and scores generated by different VLMs serving as automated judges. With Image indicates that the evaluation prompt included the associated image as additional context, whereas Without Image denotes that only the textual question and ground-truth answer were provided to the judging model. Higher correlations indicate better alignment with human judgment. The data collection phase spanned from July 2024 to February 2025, resulting in a dataset comprising 131 unique image-question pairs with ground truth. The image resolutions and aspect ratio of VisTai Dialogue are included in Appendix C.3. 4.1 Automatic Evaluation : VLM as judge To reduce the barrier of quick evaluations, we used a VLM as an automated judge to score responses based on questions, images, and ground truth answers. Using LLMs to evaluate free-form generation has become standard practice, exemplified by benchmarks such as AlpacaEval (Li et al., 2023) and MT-Bench (Zheng et al., 2023). Inspired by Vibe-Eval (Padlewski et al., 2024), we crafted an evaluation prompt that explicitly includes scoring criteria, the original question, the assistant’s response, the ground truth and optionally, the image itself. The LLM judge is instructed to first provide a detailed explanation followed by the final numerical score. To validate the reliability of LLM-generated scores, we conducted human evaluations across four selected models (gpt-4o, claude-3-5-sonnet, gpt-4o-mini, claude-3-7-sonnet). Each response was independently scored by three human annotators, and the average rating was caculated to establish a human evaluation baseline. We evaluated several prominent VLMs including gemini-2.0-pro, gemini-2.0-flash, Qwen 2.5 VL 72B, claude-sonnet-3-5-20241022, gpt-4o-2024-08-06, LLaMA Vision 90B, and Qwen VL 72B. Table 2 reports the Spearman correlations between LLM-generated scores and human evaluations. All correlations were statistically significant (p-values near zero), hence p-values are not explicitly reported. Our analysis revealed that incorporating images in the evaluation prompts generally improved model-human score correlations, contradicting earlier findings by Padlewski et al. (2024), which indicated minimal improvements from multimodal judging. Notably, the gemini-2.0-flash-001 model achieved performance comparable to claude-3-5-sonnet- 20241022, despite being approximately 36 times less costly. All reported correlations were statistically significant with p<0.001. The strongest correla- tion was observed with claude-3-5-sonnet-20241022 ( ρ=0.8336, p<0.001) when images were included. Given these results and future cost considerations, we selected gemini-2.0- flash-001 as our primary LLM judge. The complete evaluation of the performance of one model on VisTai-Dialogue cost merely $0.05 USD when using gemini-2.0-flash-001 as the automated judge. Additionally, we found no significant correlation between response length and automated judge scores (Spearman’s ρ=0.26789, p<0.001), indicating that our evaluation frame- work does not exhibit substantial length bias in its scoring. Our human annotated data is also available public for future users to calibrate newer VLMs against human preference. Additional details can be found in Appendix E.1. 5 Page 6: Preprint. Under review. 5 Benchmark Results We evaluated Xvision-language models (VLMs) spanning both closed and open-weight categories as presented in Table 3. Detailed model specifications are provided in Ap- pendix D. Our analysis reveals strong correlation between VisTai-Dialogue and VisTai-MCQ performance metrics (Spearman’s ρ=0.7746, p<1.23×10−8; Kendall’s τ=0.6086, p<2.63×10−8), showing consistency in the ranking across two benchmarks. The two Traditional Chinese VLMs in our evaluation, Breeze2-3B and Breeze2-8B, demon- strate inferior performance compared to Simplified Chinese counterparts of similar parame- ter scales, such as InternVL2.5-4B and InternVL2-8B variants. An intriguing observation emerges with Gemini-2.0-flash-thinking, which achieves exceptionally high scores in VisTai- Dialogue (6.51) but relatively modest performance in VisTai-MCQ (0.376). Conversely, deepseek-ai-deepseek-vl2-small scores 0.318 in VisTai-MCQ while nearly failing in VisTai- Dialogue. Upon manual validation, we discovered this model predominantly outputs bounding boxes for object detection tasks while neglecting other user requirements in VisTai-Dialogue. Interestingly, its smaller counterpart (tiny version) manages to partially address VisTai-Dialogue questions, suggesting potential training anomalies in the small variant. Examining the performance of smaller VLMs reveals interesting efficiency-capability trade- offs. Models with fewer parameters ( ≤4B) show consistent degradation in VisTai-Dialogue scores, with OpenGVLab-InternVL2-1B scoring just 2.13 compared to its 8B counter- part’s 3.45. However, this performance gap narrows considerably for VisTai-MCQ, where InternVL2-1B achieves 0.2689 versus InternVL2-8B’s 0.3431—only a 21.6% reduction despite using 87.5% fewer parameters. This suggests smaller models maintain reasonable factual reasoning capabilities while struggling more with open-ended, multimodal tasks. Notably, the InternVL2 series demonstrates graceful performance scaling, with consistent improve- ments as parameter count increases from 1B to 8B in both metrics, making them promising candidates for resource-constrained applications. 6 Analysis 6.1 Effect of Image Resolution on Model Performance We studied how image resolution affects model performance through systematic experi- ments. For VisTai-MCQ, we tested both directions: upscaling to 2 ×and 4×the original resolution using the state-of-the-art SwinIR model (Liang et al., 2021), as well as downscal- ing to 1 /2, 1/4, and 1 /8 of the original size. For VisTai-Dialogue, due to the initially large image sizes in this dataset, we only conducted downscaling experiments at the scales 1 /2, 1/4, and 1/8. Figure 2 illustrates our findings. For VisTai-MCQ, the performance did not improve with higher resolution, indicating that the original resolution already provided sufficient detail. For VisTai-Dialogue, the model performance stabilized around half the original resolution. Gemini-2.0-flash-lite slightly improved as resolution increased, while gemini-1.5-flash-8B showed minor declines, suggesting different sensitivities to resolution across models. 6.2 Comparison between no image provided and image In the previous section, we showed score degradation with reduced image resolution. Here, we examine performance when image context is entirely omitted. For MCQ tasks, this analy- sis establishes a baseline that measures how much linguistic knowledge alone contributes to task performance before visual information becomes beneficial. Similarly, for dialogue tasks, images-free score serve as baseline measurements, indicating how significantly performance drops when visual context is absent. We conducted this experiment using the same models evaluated in the previous section. Table 4 presents the results, which show consistent score reductions across all four models in both datasets when images are withheld. Make this context shorter as well: This performance degradation is expected and desirable, as it 6 Page 7: Preprint. Under review. ModelVisTai-Dialogue VisTai-MCQAvg Rank score rank score rank gemini-2.0-pro-exp-02-05 6.72 1 0.6619 1 1.0 gemini-2.0-flash-001 6.15 3 0.6596 2 2.5 gpt-4o-2024-11-20 6.12 4 0.5755 4 4.0 claude-3-5-sonnet-20241022 5.96 6 0.6019 3 4.5 gemini-2.0-flash-lite-preview-02-05 5.92 7 0.4992 6 6.5 Qwen2.5-VL-72B-instruct 4.87 9 0.5413 5 7.0 gemini-2.0-flash-thinking-exp-1219 6.51 2 0.3764 13 7.5 gpt-4o-2024-08-06 5.98 5 0.4000 11 8.0 gemini-1.5-pro 5.05 8 0.4417 8 8.0 gpt-4o-mini-2024-07-18 4.74 10 0.4091 10 10.0 Qwen2-VL-72B-instruct 4.21 13 0.4701 7 10.0 gemini-1.5-flash 4.26 12 0.3943 12 12.0 Qwen2.5-VL-7B-Instruct 4.54 11 0.3592 14 12.5 Llama-3.2-90B-Vision-Instruct 3.44 21 0.4119 9 15.0 InternVL2.5-8B 3.90 16 0.3447 16 16.0 InternVL2-8B-MPO 3.68 18 0.3533 15 16.5 gemini-1.5-flash-8B 4.18 15 0.3280 21 18.0 claude-3-haiku-20240307 3.70 17 0.3291 19 18.0 InternVL2-8B 3.45 20 0.3431 17 18.5 Qwen2-VL-7B-Instruct 4.21 13 0.3004 25 19.0 InternVL2.5-4B 3.60 19 0.3291 19 19.0 nova-lite-v1 3.26 22 0.3376 18 20.0 Llama-3.2-11B-Vision-Instruct 2.58 26 0.3262 22 24.0 Breeze2-8B-Instruct 3.14 23 0.2915 27 25.0 Breeze2-3B-Instruct 2.90 25 0.2971 26 25.5 InternVL2-4B 2.31 27 0.3081 24 25.5 CogVLM2-llama3-chinese-chat 2.96 24 0.2777 30 27.0 deepseek-vl2-small 0.51 31 0.3181 23 27.0 InternVL2-2B 2.22 28 0.2891 28 28.0 deepseek-vl2-tiny 2.01 30 0.2781 29 29.5 InternVL2-1B 2.13 29 0.2689 31 30.0 Table 3: For VisTai-MCQ we prompt our model in zero shot chain of thought fashion and report the average score from all 23 subjects. Models are sorted by average rank. 0.125 0.25 0.5 1.0 2.0 4.0 Scale010203040506070Score (%) VisT ai-MCQ meta-llama-Llama-3.2-11B-Vision-Instruct-Turbo gpt-4o-mini-2024-07-18 gemini-2.0-flash-lite-preview-02-05 OpenGVLab-InternVL2_5-4B gemini-1.5-flash-8b 0.125 0.25 0.5 1.0 Scale0123456Score (0-10) VisT ai-Dialogue meta-llama-Llama-3.2-11B-Vision-Instruct-Turbo gpt-4o-mini-2024-07-18 gemini-2.0-flash-lite-preview-02-05 OpenGVLab-InternVL2_5-4B gemini-1.5-flash-8b Figure 2: Effect of image resolution scaling on model performance. Left: Performance on VisTai-MCQ with both upscaled (2 ×, 4×) and downscaled (1 /2, 1/4, 1/8) images. Right: Performance on VisTai-Dialogue with downscaled images. indicates that the models appropriately struggle when visual information is referenced but not provided. 7 Page 8: Preprint. Under review. ModelVisTai-MCQ VisTai-Dialogue w/ img w/o img w/ img w/o img gemini-2.0-flash-lite-preview-02-05 0.50 0.32 5.92 2.35 gpt-4o-mini-2024-07-18 0.41 0.33 4.74 1.95 gemini-1.5-flash-8b 0.33 0.26 4.18 1.85 OpenGVLab-InternVL2 5-4B 0.33 0.29 3.60 1.29 Table 4: Performance comparison of various multimodal models across different evaluation tasks. MCQ scores represent accuracy, while dialogue scores represent quality ratings. Higher scores indicate better performance. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 VisT ai-MCQ0.20.30.40.50.60.70.8MMMUPearson r=0.96VisT ai-MCQ vs MMMU 0.2 0.3 0.4 0.5 0.6 0.7 0.8 VisT ai-MCQ0.20.30.40.50.60.70.8CMMUPearson r=0.92VisT ai-MCQ vs CMMU Breeze2-3B-Instruct Breeze2-8B-Instruct Qwen-2.5-VL-7B-Instruct Claude-3-Haiku Gemini-1.5-flash Gemini-1.5-flash-8B Gemini-2.0-flash Gemini-2.0-flash-lite Gemini-2.0-pro GPT-4o-mini LLama-3.2-11B-Vision Qwen-2.5-VL-72B-Instruct Figure 3: Comparison of VisTai-MCQ scores with MMMU (left) and CMMU (right) on a selected subset of models of varying scales. We observe a rough correlation across the three benchmarks, though some deviations suggest differences in the specific knowledge or reasoning skills each test emphasizes. 6.3 Correlations with other benchmarks As shown in Figure 3, we compare our VisTai-MCQ results with exam style benchmarks: MMMU(Yue et al., 2024) and CMMU(He et al., 2024) on a selected set of models spanning different scales. Despite the relatively small subset (due to computational and financial constraints), we observe a broadly consistent trend: models that perform well on MMMU or CMMU also tend to score higher on VisTai-MCQ. However, we also note that some points deviate from the main diagonal, suggesting that the three benchmarks are not fully interchangeable. The models show stronger relative performance on MMMU and CMMMU compare to VisTai-MCQ scores. 6.4 Contingency Plan When the Gemini-2.0-Flash Is Retired Due to our choice of using a close weight VLM : Gemini 2.0 Flash as our judge; this means it will eventually become unavailable when newer versions become available. Consequently, scores assigned by a retired model, such as gemini-2.0-flash will become difficult to compare with those of the newer VLMs. Many LLM-as-judge works overlook this scenario, typically re-running an entire benchmark using the new model as judge. However, large-scale evaluations make such an approach infeasible. To address this, we propose a calibration-based contingency plan using Qwen2.5-VL 72B , the highest performing VLM in Table 3. Figure 4 (left) shows that the raw scores for Qwen2.5- VL are systematically higher than those from gemini-2.0-flash. We therefore fit a linear mapping (based on five-vote distributions) to align Qwen2.5-VL scores with gemini-2.0- flash. After calibration, instances that gemini-2.0-flash scores as zero now cluster near zero under Qwen2.5-VL, however the number of perfect scores is reduced (Figure 4, right). After 8 Page 9: Preprint. Under review. 0 1 2 3 4 5 6 7 8 9 10 Qwen2.5 VL 72B0 1 2 3 4 5 6 7 8 9 10Gemini 2.0 flash46 42 53 37 16 7 4 2 1 0 0 3 8 10 16 19 6 4 0 0 0 0 0 4 11 14 19 14 10 2 0 0 0 0 1 1 8 19 17 12 1 1 1 0 1 0 3 6 16 16 9 5 1 0 0 0 0 2 2 6 13 39 12 2 0 0 0 0 0 1 11 8 47 26 2 2 1 0 0 0 0 0 6 21 27 20 0 0 0 0 0 1 0 0 10 17 28 12 0 0 0 0 0 0 0 0 9 15 10 4 1 0 0 0 0 0 0 2 3 25 46 0 1 2 3 4 5 6 7 8 9 10 Qwen2.5 VL 72B after calibration0 1 2 3 4 5 6 7 8 9 10Gemini 2.0 flash105 52 27 13 6 2 2 0 1 0 0 14 15 18 10 9 0 0 0 0 0 0 8 14 22 7 16 4 3 0 0 0 0 1 6 16 11 21 1 2 1 2 0 0 1 4 12 17 14 3 3 2 1 0 0 2 1 2 10 25 22 9 5 0 0 0 0 0 7 8 22 30 14 13 3 1 0 0 0 0 2 7 18 21 17 9 0 0 0 1 0 0 0 9 10 22 18 8 0 0 0 0 0 0 1 4 6 14 13 0 1 0 0 0 0 0 1 2 4 69 0 01020304050 Frequency 020406080100 Frequency Figure 4: Comparison of Gemini 2.0 flash and Qwen 2.5 VL 72B score distribution rounded to the nearest integer (left) and Qwen 2.5 VL 72B scores after calibrated against Gemini 2.0 flash (right). Model Gemini Qwen Qwen (calibrated) ∆(Cal - Gem) gemini-2.0-flash-001 6.15 6.65 5.73 -0.42 qwen2.5-vl-72b-instruct 4.87 6.18 5.12 +0.25 Llama-Breeze2-8B-Instruct 3.14 4.12 3.16 +0.02 Llama-Breeze2-3B-Instruct 2.90 4.07 3.06 +0.16 Table 5: Comparison of scores assigned by different judge models (Gemini, Qwen, and calibrated Qwen) across various evaluated models. The ∆column shows the difference between calibrated Qwen VL scores and Gemini scores, highlighting the effectiveness of the calibration. Values close to zero indicate better alignment. Higher scores in the first three columns indicate better performance. calibration, the spearman correlation with human score and ensemble of Qwen2.5-VL scores improve from 0.7993 to 0.8070. Table 5 compares raw vs. calibrated Qwen2.5-VL scores to gemini-2.0-flash on VisTai- Dialogue , demonstrating how calibration improves alignment. This result offers a practical strategy to maintain continuity of evaluation even if the original judge model is no longer available. 6.5 Direct Answer vs CoT in VisTai-MCQ To analyze the impact of reasoning on performance, we compare two prompting strategies: Zero-Shot Chain of Thought (CoT) and Direct Answer prompting. The latter asks models to provide answers without intermediate reasoning steps. Table 6 presents the comparative results. Interestingly, CoT does not universally benefit all vision-language models (VLMs). The performance gain from reasoning is predominantly observed in models that already demon- strate strong direct answer capabilities. For instance, Gemini-2.0-Flash-001, despite being smaller than Gemini-2.0-Pro (as evidenced by its lower direct answer performance), exhibits a substantially larger improvement margin (13.9%) when employing reasoning steps. Our analysis reveals that high-performing models generally gain 2-10% improvement through CoT prompting. Conversely, lower-performing VLMs show degraded performance with CoT, as evidenced by the negative differences in the lower portions of Table 6. We hypothesize that inferior vision processing capabilities in these models may introduce hallu- cinations when given extended reasoning space, ultimately diminishing their performance. 9 Page 10: Preprint. Under review. Model Name CoT Direct Answer Difference gemini-2.0-flash-001 0.660 0.521 0.139 gemini-2.0-pro-exp-02-05 0.662 0.569 0.093 gpt-4o-2024-11-20 0.576 0.486 0.090 claude-3-5-sonnet-20241022 0.602 0.519 0.083 qwen2.5-vl-72b-instruct 0.502 0.433 0.069 gpt-4o-mini-2024-07-18 0.409 0.350 0.059 Llama-Breeze2-3B-Instruct 0.297 0.267 0.030 gemini-1.5-flash 0.394 0.371 0.023 gemini-1.5-pro 0.442 0.420 0.022 gemini-1.5-flash-8b 0.328 0.309 0.019 deepseek-ai-deepseek-vl2-tiny 0.278 0.259 0.019 OpenGVLab-InternVL2-8B-MPO 0.353 0.338 0.015 Qwen-Qwen2-VL-7B-Instruct 0.300 0.287 0.013 Qwen2.5-VL-7B-Instruct 0.359 0.346 0.013 Llama-3.2-90B-Vision-Instruct-Turbo 0.412 0.404 0.008 THUDM-cogvlm2-19B 0.278 0.277 0.001 claude-3-haiku-20240307 0.329 0.330 -0.001 OpenGVLab-InternVL2-8B 0.343 0.347 -0.004 OpenGVLab-InternVL2-2B 0.289 0.294 -0.005 OpenGVLab-InternVL2 5-8B 0.345 0.352 -0.007 OpenGVLab-InternVL2-4B 0.308 0.315 -0.007 Llama-Breeze2-8B-Instruct 0.292 0.299 -0.007 Llama-3.2-11B-Vision-Instruct 0.326 0.335 -0.009 deepseek-ai-deepseek-vl2-small 0.318 0.327 -0.009 OpenGVLab-InternVL2-1B 0.269 0.282 -0.013 OpenGVLab-InternVL2 5-4B 0.329 0.345 -0.016 Table 6: Model Performance Comparison: CoT vs Direct Answer (Sorted by Difference) 10 Page 11: Preprint. Under review. 7 Conclusion In this work, we introduce VisTai-MCQ and VisTai-Dialogue, the first comprehensive bench- marks for evaluating Vision-Language Models in Traditional Chinese. Our preliminary results reveal that VLMs specifically trained on Traditional Chinese still significantly under- perform compared to their Simplified Chinese counterparts when processing visual content in Traditional Chinese contexts. Our results show a strong correlation between structured and free-form task performance suggests consistent capabilities across different evaluation formats while measuring different aspects of VLMs. Moreover, our analysis on factors such as image resolution and model scaling provides valuable insights into the practical deployment of these models. Future work should focus on improving model performance on Traditional Chinese content and expanding the benchmark to include more diverse cultural context in Taiwan. Acknowledgments We would like to thank Yi-Chang Chen, Hsin-Yi Hsieh, Shih-Cheng Huang, Cheng-Kuang Wu for the support in data creation. As well as Cheng-Kuang Wu for his or insightful discussions. References Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic. com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf , 2024a. Anthropic. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic. com/fed9cc193a14b84131812372d8d5857f8f304c52/Model Card Claude 3Addendum.pdf , 2024b. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. URL https://arxiv.org/abs/2308.12966 . Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923 . Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, et al. Evlm: An efficient vision-language model for visual understanding. arXiv preprint arXiv:2407.14177 , 2024a. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 , 2022. Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, and Wenhai Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024b. URL https://arxiv.org/abs/2404.16821 . Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Kaipeng Zhang, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, and Wenhai Wang. 11 Page 12: Preprint. Under review. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2025. URL https://arxiv.org/abs/2412.05271 . Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 7768–7791, 2024. Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In The Thirteenth International Conference on Learning Representations , 2024. Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, and Hua Huang. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011 , 2024. Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, and Jie Tang. Cogvlm2: Visual language models for image and video understanding, 2024. URL https://arxiv.org/abs/2408.16500 . Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, and Da-shan Shiu. Advancing the evaluation of traditional chinese language models: Towards a comprehensive benchmark suite. arXiv preprint arXiv:2309.08448 , 2023. Chan-Jan Hsu, Chia-Sheng Liu, Meng-Hsi Chen, Muxi Chen, Po-Chun Hsu, Yi-Chang Chen, and Da-Shan Shiu. The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capabilities. arXiv preprint arXiv:2501.13921 , 2025. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems , 35:22199–22213, 2022. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning , pp. 12888–12900. PMLR, 2022. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca eval , 5 2023. Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF interna- tional conference on computer vision , pp. 1833–1844, 2021. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 , 2022. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision , pp. 2200–2209, 2021. Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Kazuki Egashira, Jeonghun Baek, Xiang Yue, Graham Neubig, and Kiyoharu Aizawa. Jmmmu: A japanese massive multi-discipline multimodal understanding benchmark for culture-aware evaluation. arXiv preprint arXiv:2410.17250 , 2024. OpenAI. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276 . 12 Page 13: Preprint. Under review. Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287 , 2024. David Patterson, Joseph Gonzalez, Urs H ¨olzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink, 2022. URL https://arxiv.org/abs/2204. 05149 . MediaTek Research, :, Chan-Jan Hsu, Chia-Sheng Liu, Meng-Hsi Chen, Muxi Chen, Po- Chun Hsu, Yi-Chang Chen, and Da-Shan Shiu. The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capabilities, 2025. URL https://arxiv.org/abs/2501.13921 . Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 8317–8326, 2019. Zhi Rui Tam, Ya Ting Pai, Yen-Wei Lee, Hong-Han Shuai, Jun-Da Chen, Wei Min Chu, and Sega Cheng. Tmmlu+: An improved traditional chinese evaluation suite for foundation models. In First Conference on Language Modeling , 2024. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530 . Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasiku- mar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdel- rahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manju- nath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Fels- berg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, and Fahad Khan. All languages matter: Evaluating lmms on culturally diverse 100 languages, 2024. URL https://arxiv.org/abs/2411.16508 . Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, and Wanxiang Che. Cvlue: A new benchmark dataset for chinese vision-language understanding evaluation. arXiv preprint arXiv:2407.01081 , 2024. Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, et al. Worldcuisines: A massive-scale benchmark for mul- tilingual and multicultural visual question answering on global cuisines. arXiv preprint arXiv:2410.12705 , 2024. Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024. URL https://arxiv.org/abs/2412.10302 . Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline 13 Page 14: Preprint. Under review. multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 9556–9567, 2024. Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems , 36:5484–5505, 2023. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595– 46623, 2023. A Appendix B Creating VisTai-Dialogue on Discord Guidelines: The collected question-image pairs were designed to be challenging for Vision Language Models (VLMs), focusing on realistic scenarios where users might seek assistance (e.g., for visual impairment). All contributors were instructed to ensure their submissions contained no personally identifiable information (PII), including names or faces of non- public figures, Wi-Fi passwords, or any information that could reveal residential addresses. The dataset comprises question-image pairs contributed by 9 different individuals from various locations across Taiwan. While the geographic distribution covers multiple regions, there is a higher concentration of samples from Taipei (the capital of Taiwan) as the majority of our contributors resided there. B.1 Human Annotation Process As illustrated in Figure 5, steps to label a problem are: 1.Upload an image, use @ + hanji to submit it to the bot on discord, and include the question and ground truth<standard answer > 2.The bot will create a discussion thread, or people can wait until the thread appears before adding the standard answer. 3.Format for updating the standard answer: as long as the text mentions ”ground truth:”, the system will automatically capture the text that follows as the standard answer. 4. The standard answer can be updated repeatedly using step 3. 5.People can type ”validate result” to get score evaluations for responses from two models. Note: When people upload images, they waive ownership rights and agree to make the photos public. 14 Page 15: Preprint. Under review. Figure 5: Workflow of adding images, labeling, and evaluating 15 Page 16: Preprint. Under review. Subject Name Chinese Name Test Val Dev Accounting 會計學 100 29 5 Arts 藝術 385 91 5 Biology 生物學 150 8 5 Chemistry 化學 172 38 5 Chinese Literature 中國文學 100 45 5 Dentistry 牙醫學 220 49 5 Electronic Circuits 電子電路 388 91 5 Fundamentals of Physical Therapy 物理治療基礎 150 23 5 Geography 地理學 100 39 5 Mathematics 數學 240 54 5 Mechanics 力學 217 49 5 Medical 醫學 172 37 5 Music 音樂 100 6 5 Natural Science 自然科學 303 70 5 Navigation 航海學 100 16 5 Pharmaceutical Chemistry 藥物化學 100 45 5 Physics 物理學 100 23 5 Sociology 社會學 348 81 5 Statistics 統計學 100 45 5 Structural Engineering 結構工程 100 19 5 Veterinary Medicine 獸醫學 150 11 5 Table 7: Overview of subjects with VIsTA-MCQ splits C Additional Details for VisTai-MCQ and VisTai-Dialogue C.1 Details of subjects for VIsTA-MCQ Table 7 shows each of the subjects found in VIsTA-MCQ with their broad category, as well as total number of questions. C.2 VisTai-MCQ Figure 6 (left) shows the aspect ratio of mostly long rectangle, most of the width is longer than the height. In the Figure 7 (left) for the MCQ dataset most of the image pixel is below 1M pixels. C.3 VisTai-Dialogue Figure 6 (right) shows the aspect ratio of mostly long rectangle, most of the width is longer than the height. In the Figure 7 (right) for the MCQ dataset most of the image pixel is above 1M pixels. 16 Page 17: Preprint. Under review. 0 5 10 15 20 Aspect Ratio (width/height)0200400600800100012001400Number of ImagesVisT ai-MCQ Aspect Ratio Distribution 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Aspect Ratio (width/height)0102030405060VisT ai-Dialogue Aspect Ratio Distribution Figure 6: Histograms comparing the aspect ratio distributions (width/height) of the VIsTA- MCQ dataset (left) and the VIsTA-Dialogue dataset (right). The vertical dashed line on the right indicates an aspect ratio of 1 (i.e., square). Most MCQ images fall below an aspect ratio of 1 (portrait), while the Dialogue dataset shows a broader spread, with many images close to square. 0.0 0.2 0.4 0.6 0.8 1.0 Image Resolution (total pixels) 1e602004006008001000Number of ImagesVisT ai-MCQ Resolution Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Image Resolution (total pixels) 1e701020304050Number of ImagesVisT ai-Dialogue Resolution Distribution Figure 7: Comparison of image resolution distributions (in total pixels) for VIsTA-MCQ (left) and VIsTA-Dialogue (right). The x-axis shows the total number of pixels in each image, and the y-axis indicates how many images fall within each resolution range. The MCQ dataset tends to contain lower-resolution images, whereas the Dialogue dataset spans a broader range of resolutions. 17 Page 18: Preprint. Under review. D Benchmarked VLM Details For open weight models, we utilize local GPU resources (3090) for models under 11B parameters, while larger models are accessed through the Open Router API service. For closed weight models, we access them directly through each organization’s official API. Model Organization Size Arch Closed Source Models claude-3-haiku-20240307 (Anthropic, 2024a) Anthropic - - claude-3.5-sonnet-20241022 (Anthropic, 2024b) Anthropic - - gemini-1.5-flash (Team, 2024) Google - Dense gemini-1.5-flash-8b (Team, 2024) Google 8B Dense gemini-1.5-pro (Team, 2024) Google - MoE gemini-2.0-pro-exp-02-05 (Team, 2024) Google - - gemini-2.0-flash-thinking-exp-1219 (Team, 2024) Google - - gemini-2.0-flash-001 (Team, 2024) Google - - gemini-2.0-flash-lite-preview-02-05 (Team, 2024) Google - - gpt-4o-2024-11-20 (OpenAI, 2024) OpenAI - - gpt-4o-2024-08-06 (OpenAI, 2024) OpenAI - - gpt-4o-mini-2024-07-18 (OpenAI, 2024) OpenAI - - nova-lite-v1 Amazon - - Open Weights Models Gemma3 27B Instruct Google 27.4B Dense Gemma3 12B Instruct Google 12.2B Dense Qwen2.5-VL-72b-instruct (Bai et al., 2025) Alibaba 72B Dense Qwen2.5-VL-7B-Instruct (Bai et al., 2025) Alibaba 7B Dense Qwen2-VL-72b-instruct (Bai et al., 2023) Alibaba 72B Dense Qwen2-VL-7B-Instruct (Bai et al., 2023) Alibaba 7B Dense InternVL2.5-8B (Chen et al., 2025) OpenGVLab 8B Dense InternVL2-8B-MPO (Chen et al., 2024b) OpenGVLab 8B Dense InternVL2.5-4B (Chen et al., 2025) OpenGVLab 4B Dense InternVL2-8B (Chen et al., 2024b) OpenGVLab 8B Dense InternVL2-4B (Chen et al., 2024b) OpenGVLab 4B Dense InternVL2-2B (Chen et al., 2024b) OpenGVLab 2B Dense nternVL2-1B (Chen et al., 2024b) OpenGVLab 1B Dense Llama-3.2-90B-Vision-Instruct (Patterson et al., 2022) Meta 90B Dense Llama-3.2-11B-Vision-Instruct (Patterson et al., 2022) Meta 11B Dense CogVLM2-llama3-chinese (Hong et al., 2024) THUDM 19B Dense Breeze2-8B-Instruct (Research et al., 2025) MediaTek 8B Dense Breeze2-3B-Instruct (Research et al., 2025) MediaTek 3B Dense deepseek-vl2-tiny (Wu et al., 2024) DeepSeek - MoE deepseek-vl2-small (Wu et al., 2024) DeepSeek - MoE Table 8: Overview of evaluated models. For closed source models, sizes are marked with ‘-’ where not publicly disclosed. Dense stands for Decoder only Dense Transformer architecture, MoE stands for Mixture of Experts architecture. 18 Page 19: Preprint. Under review. E Prompts E.1 VisTai-Dialogue Judgment Prompt Apart from providing judgement prompts Figure8, we utilize a 10 point scale for evaluation, with the score decreasing from 10. If there are any omissions or mistakes in the model explanations, appropriate deductions will be made. Besides, in Scoring Guidelines Figure9 we also explain the range definition represented by each score, so as to better evaluate the model’s capabilities. In the Figure12 example, according to the Scoring Method and Scoring Guidelines, the model explanations accurately aligned with the Ground Truth and addressed all answers with clear and structured language. This evaluation got a perfect score of 10. E.2 VisTai-MCQ Prompt In VisTai-MCQ, our prompts can be found in Figure10 and Figure11 . We divide the evaluation into two prompts: Zero-Shot COT and Direct-Answer to examine VLM model capabilities. In these examples of Figure13 and Figure14, we use red color to mark the model’s responses and COT processes. For COT prompts, we specifically use ”think step by step” to guide the model’s reasoning further. F Does exams with figure harder than normal exam questions? In this section we discuss whether exam questions grounded on visual context are harder than other questions from the same exam without context. Meaning does adding additional visual feature makes the benchmark harder? 19 Page 20: Preprint. Under review. Judgement Prompt 請根據使用者詢問的問題[Question]與正確答案[Ground Truth],去評價助手的回覆 [Assistant Response] 的評分,評分依照下方的評價指導手冊去評分。 Please evaluate the assistant’s response based on the user’s question [Question] and the correct answer [Ground Truth], and assess the score of the assistant’s response [Assistant Response] according to the evaluation guidelines provided below. [問題] [Question] {question } [評價助手的回覆] [Assistant Response] {response } [正確答案] [Ground Truth] {ground truth} #評分標註指南# Scoring Guidelines {Scoring Guidelines } ##評分方式## Scoring Method 1.首先將回答與標準答案比較1. Responses were compared with standard answers. 2.評估以下幾點:2. Evaluate the following aspects: -是否準確?- Whether the responses were accurate? -是否回答了問題的所有部分- Whether all parts of the question were addressed? -是否清晰且有條理?- Whether the responses were clear and well- structured? -是否提供有幫助的補充說明?- Whether helpful explanations were provided? 3.從10分開始扣分:3. Points were deducted for errors from an initial 10- point score. -每個事實錯誤(-1至-2分)- Each factual error deducted 1–2 points. -遺漏資訊(-1至-2分)- Missing information deducted 1–2 points. -語言組織不佳(-1分)- Incoherent language structure deducted 1–2 points. -補充說明不當或錯誤(-1分)- Inappropriate or incorrect explanations deducted 1 point. 4.請簡短說明評分理由,包含:Scoring rationale includes concise explana- tions of: -做得好的地方- Well-addressed components or exceptional perfor- mance. -缺少或錯誤的部分- Missing or incorrect information -為何給予此分數- Why this score was awarded? 你的回覆格式應該是如下: Your response should be in the format: [解釋]: (你的解釋) [Explanations]: (Your explanations) [評分]: (int分數) [Scoring]: (Int score) Figure 8: Instruction prompt used in scoring the sample. 20 Page 21: Preprint. Under review. Scoring Guidelines ##評分範圍(0-10分)Score Range (0 - 10 points) ### 10分:完美10 points: Perfect -完全準確無誤- Fully accurate and error-free -回答問題的所有部分- Address all parts of the questions -清晰且條理分明- Clear and well-structured -提供有幫助的補充說明- Helpful explanations ### 8 - 9分:非常好8-9 points: Excellent -有些微錯誤或遺漏- Minor errors or omissions -主要重點都有涵蓋- Key points are covered -組織良好- Clear and well-structured -提供有用的細節說明- Helpful explanations ### 6 - 7分:良好6-7 points: Good -有一些小錯誤- Some minor errors -大部分重點都有提到- Most key points are addressed -組織尚可- Adequate structure -有一些相關說明- Some relevant explanations ### 4 - 5分:普通4-5 points: Average -有數個錯誤- Several errors -遺漏一些重點- Some key points are omitted -基本的組織結構- Basic structure -說明不夠完整- Incomplete explanations ### 2 - 3分:不佳2-3 points: Poor -有許多錯誤- Numerous errors -遺漏重要資訊- Critical information is omitted -組織不清楚- Disorganized structure -說明不正確或不足- Incorrect or insufficient explanations ### 0 - 1分:不及格0-1 points: Failing -大部分錯誤- Mostly incorrect -未回答問題重點- The core question remains unanswered -沒有明確的組織- Absence of structure -說明不相關或有誤導- Irrelevant or misleading explanations Figure 9: Scoring guidelines outline the criteria for each point 21 Page 22: Preprint. Under review. VisTai-MCQ - Zero-Shot COT Prompt 回答以下的多選題問題。並且在回覆的最後記得講格式:答案:$字母而字母是 ABCDEFG 的其中一個。回答前請先一步一步(think step by step) 想好答案。你必須 使用中文回答。 Answer the following multiple-choice questions. Additionally, remember to include the format at the end of your response: Answer: $letter, where the letter is one of A, B, C, D, E, F, G. Before answering, carefully deliberate through the answer step by step. You must respond in Chinese. [問題]: [Question] {question } [COT]:[COT:] {COT reasoning } [答案]:[Answer:] {your answer } Figure 10: Instruction prompt used in VisTai-MCQ zero-shot COT VisTai-MCQ - Direct-Answer Prompt 回答以下的多選題問題。並且在回覆的最後記得講格式:答案:$字母而字母是 ABCD的其中一個。直接給我答案回覆,Do not think step by step 。你必須使用中 文回答。 Answer the following multiple-choice question. Remember to include the format at the end of your response: Answer: $letter, where the letter is one of A, B, C, or D. Provide only the answer in Chinese without any additional explanations. Do not think step by step. You must respond in Chinese. [問題]: [Question] {question } [答案]:[Answer:] {your answer } Figure 11: Instruction prompt used in VisTai-MCQ direct answer 22 Page 23: Preprint. Under review. VisTai-Dialogue Example 請根據使用者詢問的問題[Question]與正確答案[Ground Truth], 去評價助 手的回覆[Assistant Response] 的評分,評分依照下方的評價指導手冊去評分。 Please evaluate the assistant’s response based on the user’s question [Question] and the correct answer [Ground Truth], and assess the score of the assistant’s response [Assistant Response] according to the evaluation guidelines provided below. [問題] [Question] 這在台灣哪的建築?能猜出這是什麼時間點拍的嗎? Where is this building located in Taiwan? Can the time period during which this photo was taken be determined? [評價助手的回覆] [Assistant Response] 這張照片拍攝的是台北 101大樓,畫面中出現的煙火表演通常是在跨年時舉行的,因 此可以推測這張照片大約是跨年夜所拍攝的。台北 101在新年倒數時會舉行盛大的煙 火秀,吸引了許多民眾前來觀賞。 This photograph depicts Taipei 101, with fireworks display typically held during New Year’s Eve celebrations. It is inferred that this photo was taken on New Year’s Eve, as Taipei 101 annually hosts a grand fireworks show during the New Year countdown, drawing large crowds of spectators. [正確答案] [Ground Truth] 這是台灣台北著名的「台北 101」,此圖是元旦跨年時101的煙火秀 This is the renowned Taipei 101 in Taipei, Taiwan. The image captures the New Year’s Eve fireworks display at the building during the annual countdown celebration. 你的回覆格式應該是如下: Your response should be in the format: [解釋]: (你的解釋) [Explanations]: (Your explanations) [Explanations]: Completely correct [評分]: (int分數) [Scoring]: (Int score) [Scoring]: 10 Figure 12: VisTai-Dialogue for evaluating the response with prefect 10 points. 23 Page 24: Preprint. Under review. VisTai-MCQ - Zero-Shot COT Example 回答以下的多選題問題。並且在回覆的最後記得講格式:答案:$字母而字母 是ABCDEFG 的其中一個。回答前請先一步一步(think step by step) 想好答案。你必 須使用中文回答。 Answer the following multiple-choice questions. Additionally, remember to include the format at the end of your response: Answer: $letter, where the letter is one of A, B, C, D, E, F, G. Before answering, carefully deliberate through the answer step by step. You must respond in Chinese. [問題]: 甲公司係採用帳齡分析法估計預期信用減損損失,X8年12月31日應收帳款帳齡分 析資料如下:X8年5月12日曾將確認無法收回之應收帳款$35,680沖銷;X8年12 月31日調整前備抵損失為借餘$970。甲公司X8年度應認列之預期信用減損損失為 何? [Question]: Company A uses the aging analysis method to estimate expected credit losses. The aging analysis data for accounts receivable as of December 31, Year X8 is as follows: On May 12, Year X8, the company wrote off accounts receivable of $35,680 confirmed to be uncollectible; the allowance for credit losses had a debit balance of $970 before adjustment on December 31, Year X8. What amount of expected credit loss should Company A recognize for the Year X8? A. $37,965A. $37,965 B. $36,935B. $36,935 C. $36,995C. $36,995 D. $38,935D. $38,935 [COT]: 我將計算預期信用損失。我將每個帳齡類別的餘額乘以相應的損失率,然後將結 果相加。最後,我會調整信用損失準備金中的借方餘額。預期信用損失計算如 下:1. 30天以下:$489,000 * 0.5% = $2,445 2. 31-90天:$201,000 * 2% = $4,020 3. 91-180天:$106,000 * 5% = $5,300 4. 181-360天:$48,500 * 20% = $9,700 5. 361天以 上:$27,500 * 60% = $16,500 總預期信用損失:$2,445 + $4,020 + $5,300 + $9,700 + $16,500 = $37,965 調整前信用損失準備的借方餘額:$970因此,應確認的預期信用損失金額 為:$37,965 + $970 = $38,935 [答案]: D [COT]: I will calculate the expected credit loss by multiplying the balance of each aging category by the corresponding loss rate and summing the results. Finally, adjust- ments are made for the debit balance in the credit loss reserve. The expected credit loss calculation is as follows: 1. Under 30 days: $489,000 ×0.5% = $2,445 2. 31-90 days: $201,000 ×2% = $4,020 3. 91-180 days: $106,000 ×5% = $5,300 4. 181-360 days: $48,500 ×20% = $9,700 5. Over 361 days: $27,500 ×60Total expected credit loss: $2,445 + $4,020 + $5,300 + $9,700 + $16,500 = $37,965 Debit balance of credit loss reserve before adjustment: $970 Accordingly, the recognized expected credit loss amount is: $37,965 + $970 = $38,935 [Answer]: D Figure 13: VisTai-MCQ - Zero-Shot COT with reasoning and correct response 24 Page 25: Preprint. Under review. VisTai-MCQ - Direct Answer Example 回答以下的多選題問題。並且在回覆的最後記得講格式:答案:$字母而字母 是ABCD的其中一個。直接給我答案回覆,Do not think step by step 。你必須使用 中文回答。 Answer the following multiple-choice question. Remember to include the format at the end of your response: Answer: $letter, where the letter is one of A, B, C, or D. Provide only the answer in Chinese without any additional explanations. Do not think step by step. You must respond in Chinese. [問題]: X1年間總公司總計將成本$150,000之商品運交分公司,轉撥價格均按成本加價 10%,且分公司不自外界進貨。總公司運交分公司之一批商品,至期末分公司仍未 收到,而導致「總公司往來」與「分公司往來」金額不一致。總公司與分公司於X1 年底調整前試算表部分資料如下:試問分公司帳上期初存貨餘額為何? [Question]: During the year X1, the Head Office transferred goods costing $150,000 to the Branch at a transfer price of cost plus 10%, with the Branch not sourcing inventory externally. A shipment of goods dispatched by the Head Office to the Branch remained undelivered at period-end, creating discrepancies between the balances of the Head Office Current Account and Branch Current Account. Selected pre-adjustment trial balance data from both entities as of year-end X1 are provided. What was the Branch’s beginning inventory balance? A. $33,000A. $33,000 B. $30,000B. $30,000 C. $0C. $0 D. $20,000D. $20,000 [答案]:B [Answer]: B Figure 14: VisTai-MCQ - direct answer example with incorrect responses (the answer is A) 25

---