loader
Generating audio...

arxiv

Paper 2503.10497

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Authors: Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Yun Xing, Junjue Wang, Huitao Li, Xin Li, Kunyu Yu, Nan Liu, Qingyu Chen, Douglas Teodoro, Edison Marrese-Taylor, Shijian Lu, Yusuke Iwasawa, Yutaka Matsuo, Irene Li

Published: 2025-03-13

Abstract:

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

Paper Content:
Page 1: arXiv:2503.10497v1 [cs.CL] 13 Mar 2025MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation Weihao Xuan1,Rui Yang2,Heli Qi3,Qingcheng Zeng4,Yunze Xiao5 Yun Xing6,Junjue Wang1,Huitao Li2,Xin Li2,Kunyu Yu2 Nan Liu2,Qingyu Chen7,Douglas Teodoro8,Edison Marrese-Taylor1 Shijian Lu6,Yusuke Iwasawa1,Yutaka Matsuo1,Irene Li1 1The University of Tokyo,2Duke-NUS Medical School,3Waseda University 4Northwestern University,5Carnegie Mellon University 6Nanyang Technological University,7Yale University,8University of Geneva weihaoxuan@g.ecc.u-tokyo.ac.jp ,ireneli@ds.itc.u-tokyo.ac.jp https://mmluprox.github.io/ Abstract Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchm ark covering 13 ty- pologically diverse languages with approximately 11,829 q uestions per language. Building on the challenging reasoning-focused design of MM LU-Pro, our frame- work employs a semi-automatic translation process: transl ations generated by state-of-the-art large language models (LLMs) are rigorou sly evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-a rt LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategie s, analyzing their per- formance across linguistic and cultural boundaries. Our ex periments reveal con- sistent performance degradation from high-resource langu ages to lower-resource ones, with the best models achieving over 70% accuracy on Eng lish but dropping to around 40% for languages like Swahili, highlighting pers istent gaps in multilin- gual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional la nguages and eval- uating more language models to provide a more comprehensive assessment of multilingual capabilities. 1 Introduction As large language models (LLMs) continue to advance and beco me widely integrated into vari- ous applications [25, 10, 26], assessing their performance across diverse languages and cultures is increasingly critical [24, 9]. Effective multilingual e valuation ensures the fairness, reliability, and global accessibility of LLMs, particularly benefiting u sers from diverse linguistic and cultural backgrounds who might otherwise face disparities in model p erformance. Two fundamental bench- marks for evaluating English-centric LLMs are MMLU [13] and MMLU-Pro [22]. Specifically, MMLU-Pro enhances MMLU by increasing the number of answer ch oices, incorporating more complex reasoning-focused questions, and removing invali d or ambiguous questions, thereby sig- nificantly improving the benchmark’s robustness and discri minative power. Among multilingual benchmarks, Global MMLU [16] extends coverage to 42 languag es and categorizes questions into culture-sensitive and culture-agnostic subsets. However , since Global MMLU fundamentally relies on the original MMLU framework, it inherits limitations rel ated to question complexity and quality, potentially leading to rapid saturation as LLMs continue to improve. Preprint. Page 2: To address these limitations, we introduce MMLU-ProX, a nov el multilingual benchmark that builds upon the challenging, reasoning-focused design of MMLU-Pr o while extending its coverage to 13 ty- pologically diverse languages. Our benchmark contains 11, 829 questions per language while main- taining MMLU-Pro’s original structure. Unlike previous ap proaches that rely on direct machine translation, our framework employs a semi-automatic trans lation process where initial translations generated by state-of-the-art LLMs undergo rigorous evalu ation by expert annotators. This method- ology ensures conceptual accuracy, terminological consis tency, and cultural relevance across all target languages, mitigating the quality concerns that oft en plague multilingual benchmarks. In this paper, we present a comprehensive evaluation of 25 state-of -the-art large language models, employ- ing both 5-shot chain-of-thought (CoT) [23] and 0-shot prom pting strategies. Our analysis reveals significant performance variations across linguistic and c ultural boundaries, highlighting the persis- tent challenges in developing truly multilingual AI system s. The results demonstrate that even the most advanced models show substantial performance gaps bet ween high-resource and low-resource languages, underscoring the importance of culturally and l inguistically inclusive benchmark develop- ment. MMLU-ProX offers researchers and practitioners a rob ust tool for evaluating the cross-lingual reasoning capabilities of LLMs, providing valuable insigh ts for improving model development and deployment in multilingual contexts. By setting a new stand ard for multilingual evaluation, we aim to accelerate progress toward more equitable and globally a ccessible language technologies that can serve diverse user populations with comparable levels of pe rformance and reliability. 2 Related Work 2.1 Multilingual Large Language models Recent advancements in multilingual large language models (LLMs) have significantly enhanced their ability to process and generate text across diverse la nguages. More recent models, including GPT-4o [2], Claude-3.7 [4], Gemini [7], and Llama-3 [11], ha ve shown notable improvements in complex reasoning across over 30 languages, including Chin ese, Japanese, and Spanish. To rigor- ously assess the gap in LLM’s reasoning ability in different languages and push the capabilities of these LLMs, we introduce MMLU-ProX, a new multilingual benc hmark designed to test the upper limits of reasoning and knowledge in advanced language mode ls. 2.2 LLM Evaluation Benchmarks Recent advancements in LLM evaluation have seen a strategic shift toward multilingual benchmarks to better assess cross-linguistic capabilities. Although foundational benchmarks such as GLUE [21] and SuperGLUE [20] focused on English, newer frameworks now emphasize linguistic diversity. The Multilingual Massive Multitask Language Understandin g (MMMLU) benchmark [13], which evaluates models in 14 human-translated languages, includ ing Swahili and Yoruba, exposes perfor- mance gaps in low-resource languages, even state-of-the-a rt models score 25% lower in Yoruba compared to English. Similarly, BIG-bench [17] incorporates tasks spanning 1,0 00+ languages through its conlang transla- tion and language identification challenges, though result s reveal stark disparities between high- and low-resource language performance. Specialized bilingua l evaluations such as HellaSwag-Pro [15] test the robustness through adversarially filtered Chinese -English common-sense reasoning variants, demonstrating that phrasing perturbations cause accuracy drops exceeding 15% in both languages. These multilingual extensions reveal critical limitation s masked by English-centric benchmarks, par- ticularly in semantic consistency across linguistic struc tures and resistance to prompt variations. MMLU-ProX translates MMLU-Pro into multiple languages, en abling direct cross-linguistic per- formance comparisons of language models. This approach unc overs language-specific performance disparities, offering insights into linguistic, cultural , or data-related factors affecting model perfor- mance. 2 Page 3: 3 Benchmark 3.1 Overview MMLU-ProX extends the challenging MMLU-Pro benchmark to en compass 13 typologically di- verse languages: English (EN), Chinese (ZH), Japanese (JA) , Korean (KO), French (FR), Ger- man (DE), Spanish (ES), Portuguese (PT), Arabic (AR), Thai ( TH), Hindi (HI), Bengali (BN), and Swahili (SW). This benchmark maintains the high difficulty l evel and reasoning focus of MMLU- Pro while enabling rigorous evaluation of language models’ cross-lingual reasoning capabilities. By carefully translating the same set of questions across all l anguages, MMLU-ProX facilitates direct comparison of model performance across linguistic boundar ies while controlling for question diffi- culty. 3.2 Data Curation We meticulously preprocessed the MMLU-Pro dataset before t ranslation. First, we identified and re- moved duplicate question-answer pairs that appeared acros s different subjects, ensuring each reason- ing challenge appeared only once in our benchmark. Second, w e performed thorough data cleaning to address grammatical issues in the source material, inclu ding run-on words, incorrect hyphenation, and other syntactic anomalies that could potentially confo und the translation process. This curation step was critical to establish a clean baseline for our multi lingual translations, as source-language errors can propagate and amplify through translation pipel ines, particularly in technical and special- ized domains that predominate in MMLU-Pro. 3.3 Translation Pipeline We developed a sophisticated multi-stage translation pipe line to maintain conceptual accuracy and linguistic naturalness across all target languages. The pr ocess begins with Claude 3.7 Sonnet [4] generating initial translations, followed by a self-reflec tion phase where the model evaluates its own output for terminological accuracy and content preserv ation. This reflection mechanism allows the identification of potentially problematic translation s, particularly technical terms that require specialized domain knowledge. Given the documented variation in cross-lingual capabilit ies across different large language mod- els [6], we implemented a verification step using GPT-4o to as sess and refine Claude’s translations independently. This secondary review focuses particularl y on terminology precision and linguistic fluency in each target language, leveraging the complementa ry strengths of both models. We im- plemented additional verification protocols for languages where either model demonstrates known weaknesses to ensure translation quality. The entire pipel ine operates with minimal human inter- vention until the final verification stage, enabling efficien t scaling across multiple languages while maintaining quality standards. 3.4 Human Verification To rigorously validate translation quality, we are conduct ing systematic human evaluation across all language-subject combinations. For each language and subj ect category, we uniformly sample 20 entries for expert assessment along three dimensions: accu racy (preservation of source meaning), fluency (naturalness in target language), and completeness (retention of all information). Human evaluators rate each dimension on a 5-point Likert scale, wi th 5 representing perfect quality and 1 indicating severe deficiencies. We must emphasize that this verification process adheres to stringent quality control protocols and is still ongoing. No translat ions are incorporated into the final bench- mark without passing this verification step. Preliminary re sults from completed evaluations suggest high overall quality (mean scores above 4.2 across dimensio ns), but we maintain a conservative stance until all verifications are complete. Any translatio ns scoring below established thresholds are returned to the pipeline for revision or completely redone w ith direct expert involvement. 3.5 Total Cost It is worth noting that developing MMLU-ProX represents a su bstantial investment in resources. Between professional translation APIs, human verification costs, and computational resources (ap- proximately 4,000 H100 GPU hours for running evaluations ac ross models), the current develop- 3 Page 4: Models Overall EN ZH JA KO FR DE ES PT AR TH HI BN SW Qwen2.5-3B [24] 29.6 44.7 37.5 32.0 27.8 37.7 32.4 37.1 35.9 28.3 26.7 16.3 15.0 12.7 Qwen2.5-7B [24] 42.5 56.9 49.7 45.4 41.8 49.1 47.6 50.0 49.6 41.1 40.1 32.2 30.6 18.6 Qwen2.5-14B [24] 50.9 64.1 57.4 53.5 52.1 58.6 56.0 58.1 57.7 50.5 49.2 40.2 37.3 27.5 Qwen2.5-32B [24] 58.3 68.7 62.6 61.7 59.1 65.2 63.4 65.1 64.7 58.5 56.8 49.1 48.1 35.0 Qwen2.5-72B [24] 62.0 70.3 65.9 63.4 62.1 67.1 65.9 66.5 66.6 62.1 60.1 58.0 57.6 40.1 QwQ-32B§[19] 60.2 70.7 65.7 62.8 62.6 67.4 63.0 66.7 65.3 62.0 61.7 49.1 52.7 32.8 Llama3.1-8B [11] 29.2 43.5 33.4 28.7 22.4 35.4 33.9 35.7 33.2 21.5 27.3 23.5 19.5 20.9 Llama3.1-70B [11] 52.9 62.1 54.6 51.7 49.5 58.2 55.4 58.3 57.3 48.6 52.8 48.9 45.7 44.6 Llama3.1-405B [11] 60.1 68.8 62.5 59.9 51.6 65.1 64.4 64.9 64.3 55.4 59.1 58.0 54.9 52.1 Llama3.3-70B [11] 57.1 65.7 58.4 57.0 54.5 62.1 59.8 61.5 61.4 51.0 56.0 55.4 50.1 49.0 DeepSeek-R1-7B§[12] 26.1 46.0 37.5 4.4 5.8 37.2 35.1 38.1 37.7 27.7 21.5 24.3 13.3 10.8 DeepSeek-R1-8B§[12] 23.6 41.2 28.2 22.4 13.7 27.6 29.1 32.5 31.3 20.5 18.6 19.6 11.8 10.6 DeepSeek-R1-14B§[12] 45.3 61.3 53.4 47.8 46.4 51.5 51.1 52.6 54.9 45.1 43.1 33.9 30.1 17.3 DeepSeek-R1-32B§[12] 54.7 68.5 58.8 51.0 56.9 60.6 61.3 61.6 63.9 57.3 54.8 47.2 44.1 25.0 DeepSeek-R1-70B§[12] 55.4 67.3 54.1 53.5 50.4 61.5 60.1 62.1 61.0 53.6 53.6 52.5 47.6 43.2 Phi4-mini-3.8B [1] 27.8 46.6 34.4 22.2 23.8 37.5 37.1 36.6 35.6 26.6 13.8 19.3 12.2 15.5 Phi4-14B [1] 55.2 63.7 58.8 54.7 54.5 62.9 62.2 63.0 62.5 54.6 49.9 49.4 43.7 37.9 Gemma2-2B [18] 20.6 29.5 22.9 19.7 18.3 23.0 21.6 23.8 22.3 17.1 18.5 17.4 15.2 18.0 Gemma2-9B [18] 42.7 51.8 43.3 41.1 40.1 47.0 45.3 47.7 47.7 38.5 40.3 39.3 35.5 37.9 Gemma2-27B [18] 50.7 58.0 50.4 48.4 48.2 54.5 53.2 54.6 54.0 48.8 49.2 48.0 45.2 46.3 Mistral-7B-v0.2 [14] 19.7 31.7 22.4 19.3 17.4 27.1 26.4 26.8 26.1 13.4 11.8 10.8 10.0 12.3 Mistral-Small-24B [3] 46.4 61.2 54.8 51.8 50.3 58.2 56.5 57.7 57.8 30.2 36.1 32.0 30.8 26.4 Aya-23-8B*[5] 17.6 23.8 21.5 18.3 18.6 22.2 19.0 22.1 22.5 19.3 8.1 16.4 7.8 8.8 Aya-23-35B*[5] 25.4 34.6 30.0 26.7 27.6 32.5 30.4 32.5 30.8 28.2 13.9 22.9 10.6 9.9 InternLM3-8B [8] 26.6 43.8 43.9 30.2 26.8 34.8 34.2 35.9 34.4 21.9 5.5 12.6 7.8 14.1 Table 1: Performance (%) of 25 language models on MMLU-ProX u sing 5-shot chain-of-thought (CoT) prompting across 13 languages. The highest score in ea ch column is in bold . Lan- guage codes: EN=English, ZH=Chinese, JA=Japanese, KO=Kor ean, FR=French, DE=German, ES=Spanish, PT=Portuguese, AR=Arabic, TH=Thai, HI=Hindi , BN=Bengali, SW=Swahili.§in- dicates reasoning-enhanced models;*indicates models specifically designed for multilingual pe r- formance. ment cost approaches $60,000 USD at market rates. This inves tment underscores our commitment to creating a high-quality, reliable benchmark for advanci ng multilingual AI capabilities. 4 Experiments 4.1 Setups We evaluate 25 state-of-the-art large language models on MM LU-ProX across 13 linguistically di- verse languages. The models span various architectures, pa rameter scales, and training paradigms, including both open-source (Qwen [19], Llama [11], DeepSee k [12], Phi [1], Gemma [18], Mis- tral [3], Aya [5], and InternLM [8]) and proprietary models. We assess model performance using two prompting strategies: 5-shot chain-of-thought (CoT) r easoning and zero-shot generation. Following MMLU-Pro [22], for the CoT approach, we provide 5- shot examples of reasoning chains specific to each question category, instructing models to ve rbalize their reasoning process before providing an answer. This allows us to evaluate their step-b y-step problem-solving abilities across languages with minimal guidance. For zero-shot answering, we do not provide any examples where models directly produce answers. All evaluations use the sa me set of questions across languages to ensure fair comparison. 4.2 Results 4.2.1 Overall Performance Tables 1 and 2 present comprehensive results for 5-shot CoT a nd zero-shot answering approaches, re- spectively. The overall performance shows that larger mode ls consistently outperform their smaller counterparts within the same family. For instance, Qwen2.5 -72B (62.0%) substantially outperforms Qwen2.5-3B (29.6%) in overall CoT performance, demonstrat ing a clear correlation between model 4 Page 5: Models Overall EN ZH JA KO FR DE ES PT AR TH HI BN SW Qwen2.5-32B [24] 58.0 69.3 64.1 60.9 58.7 64.4 62.4 65.2 64.8 56.6 58.0 49.0 49.5 31.0 Qwen2.5-72B [24] 63.1 72.3 66.0 64.5 62.1 68.8 67.7 68.2 68.7 62.1 62.3 59.7 58.7 38.9 QwQ-32B§[19] 51.8 65.4 54.0 51.1 59.1 61.0 60.8 60.5 62.9 56.1 56.0 28.2 34.4 23.4 Llama3.3-70B [11] 57.6 66.5 61.9 57.0 53.0 63.3 58.9 64.2 63.5 53.8 57.8 47.6 48.8 52.1 DeepSeek-R1-70B§[12] 49.0 62.0 49.7 41.8 40.7 54.0 52.5 59.6 58.8 49.3 56.2 42.5 23.2 46.9 Mistral-Small-24B [3] 45.0 64.5 57.6 50.2 49.6 54.8 60.1 53.9 52.3 42.0 32.7 27.5 27.7 12.3 Aya-23-35B*[5] 17.6 31.1 24.5 18.5 16.2 29.2 20.7 28.9 21.2 15.4 3.7 14.0 0.8 5.1 Table 2: Performance (%) of 7 language models on MMLU-ProX us ing zero-shot prompting across 13 languages. The highest score in each column is in bold . Language codes: EN=English, ZH=Chinese, JA=Japanese, KO=Korean, FR=French, DE=Germa n, ES=Spanish, PT=Portuguese, AR=Arabic, TH=Thai, HI=Hindi, BN=Bengali, SW=Swahili.§indicates reasoning-enhanced mod- els;*indicates models specifically designed for multilingual pe rformance. scale and multilingual reasoning capabilities. Among mode ls of comparable size, Qwen2.5 and Llama3 families generally achieve superior performance, f ollowed by DeepSeek R1-distilled and Phi4, with Mistral, Gemma2, Aya-23, and InternLM3 showing r elatively lower performance on this benchmark. Notably, Phi4-14B (55.2%) performs except ionally well relative to its size, even outperforming some larger models like Gemma2-27B (50.7%), indicating training strategies also play crucial roles in multilingual performance. In our test ed large models, certain models (such as Qwen2.5-72B and Llama3.3-70B) perform comparably or bet ter with zero-shot answering com- pared to 5-shot CoT. Qwen2.5-32B shows similar performance with zero-shot answering (58.0%) compared to CoT (58.3%), while Qwen2.5-72B demonstrates hi gher accuracy with zero-shot an- swering (63.1%) than with CoT (62.0%). This pattern suggest s that for well-trained large models, especially on higher-resource languages, the additional c ontext provided by few-shot examples may not always be necessary for optimal performance. 4.2.2 Impact of Reasoning-Enhanced Post-training Our analysis reveals notable patterns when comparing reaso ning-enhanced models with their base counterparts. DeepSeek-R1-distilled models, which are bu ilt upon Qwen2.5 and Llama bases, show interesting performance characteristics across language s. The DeepSeek-R1-32B (54.7%) underper- forms compared to Qwen2.5-32B (58.3%) on overall CoT perfor mance, suggesting that reasoning- specific fine-tuning may not universally enhance multilingu al capabilities without careful attention to linguistic diversity. Similarly, the QwQ-32B model, ano ther reasoning-enhanced model built upon Qwen2.5-32B, achieves 60.2% overall performance with CoT, outperforming both its base model and DeepSeek-R1-32B. This indicates that different r easoning-enhancement approaches may have varying impacts on cross-lingual reasoning abilities . QwQ-32B performs well in CoT rea- soning (60.2%) but shows a significant drop in zero-shot answ ering (51.8%), suggesting complex interactions between reasoning enhancement techniques an d evaluation methods. We observe that reasoning-enhanced models generally maint ain the linguistic performance hierar- chy of their base architectures, with similar relative gaps between high and low-resource languages. Different reasoning enhancement methods affect language p erformance in various ways. For in- stance, QwQ-32B performs excellently on English, Chinese, and other high-resource languages, while also showing notable improvements on traditionally m edium-to-low-resource languages like Thai and Bengali, with varying performance on Hindi and Swah ili. This suggests that reasoning enhancement techniques may have differential impacts acro ss the language spectrum, underscoring the importance of incorporating linguistic diversity in re asoning-focused training. 4.2.3 Cross-lingual Performance A striking observation from our results is the consistent pe rformance degradation from high-resource languages to lower-resource languages. Most models typica lly show English performing best, fol- lowed by European languages (FR, DE, ES, PT), then East Asian languages (ZH, JA, KO), with South Asian and other low-resource languages performing we aker, although individual languages (such as Arabic and Thai) may vary in relative ranking across different models. The best-performing model on English (QwQ-32B) achieves 70.7% accuracy with CoT , but drops to 52.7% for Bengali 5 Page 6: and 32.8% for Swahili, highlighting the persistent gaps in m ultilingual capabilities. Similarly, the best overall model Qwen2.5-72B achieves 70.3% on English bu t only 57.6% for Bengali and 40.1% for Swahili, demonstrating that even the most capable model s struggle with lower-resource lan- guages. Some models, such as the Llama3 series, show relativ ely strong performance on Swahili (with Llama3.1-405B reaching 52.1%), indicating that cert ain training strategies may have unique advantages in specific low-resource languages. Performanc e gaps between high and low-resource languages are more pronounced in smaller models. For exampl e, DeepSeek-R1-7B shows a 35.2% point difference between English (46.0%) and Swahili (10.8 %), while DeepSeek-R1-70B exhibits a smaller but still substantial 24.1% gap. Notably, despite Aya models being specifically trained for m ultilingual performance with focused capacity on fewer languages (compared to its predecessor Ay a 101), they still struggle on MMLU- ProX, suggesting that multilingual reasoning capabilitie s require more than just language-focused pretraining. 4.2.4 Impact of Prompting Strategy The comparison between Tables 1 and 2 reveals nuanced intera ctions between prompting strategies and languages. The impact of prompting strategy varies by mo del. For example, Qwen2.5-72B ben- efits from zero-shot answering across most languages, while QwQ-32B benefits more from CoT rea- soning across all languages, particularly for low-resourc e languages. Llama3.3-70B shows a mixed pattern, with languages like Hindi performing better with C oT and Swahili performing better with zero-shot answering. This complexity in the interaction be tween models and prompting strategies further emphasizes the importance of comprehensive evalua tion of multilingual capabilities and sug- gests that models’ reasoning abilities may be influenced by m ultiple factors simultaneously. Despite differences in absolute performance, the relative ranking of languages remains largely consistent across both prompting strategies, suggesting that the cros s-lingual performance gaps are intrinsic to the models rather than artifacts of specific evaluation me thods. These findings underscore the value of MMLU-ProX as a diagnostic tool for identifying and a ddressing cross-lingual performance disparities, ultimately contributing to the development o f more equitable language technologies. 5 Conclusion We introduced MMLU-ProX, a multilingual benchmark spannin g 13 diverse languages for evalu- ating cross-lingual capabilities in LLMs. Our semi-automa tic translation methodology combines LLM-generated translations with expert validation to ensu re quality across languages. Comprehen- sive evaluation of 25 state-of-the-art models reveals subs tantial performance disparities, with even the best model (Qwen2.5-72B) showing a 30.2% gap between Eng lish (70.3%) and Swahili (40.1%) performance. We find that larger models consistently outper form smaller ones, reasoning-enhanced training yields inconsistent benefits across languages, an d different prompting strategies have vary- ing effectiveness depending on language resource levels. F uture work will focus on expanding MMLU-ProX to include additional languages beyond the curre nt 13, and evaluating newer state- of-the-art models such as Gemma 3 as they emerge. MMLU-ProX p rovides valuable insights for developing more equitable language technologies across li nguistic boundaries. Acknowledgments This research was supported by several organizations. The J apan Society for the JSPS KAKENHI provided funding under Grant Number 24K20832. Additional s upport was received from JST ActX, Grant Number JPMJAX24CU. We also acknowledge the contribut ions of NVIDIA through thfeir Academic Grant Program and Google via the Gemma Academic Pro gram. The authors sincerely thank all of these agencies and organizations for their inva luable assistance. 6 Page 7: References [1] Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bube ck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffm ann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905 , 2024. [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Ana dkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. [3] Mistral AI. Mistral-small-24b-instruct-2501, 2025. [4] Anthropic. Claude 3.7 sonnet, February 2025. Large lang uage model. [5] Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Da sh, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, e t al. Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032 , 2024. [6] Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mar a Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, et al. Wmt24++ : Expanding the language coverage of wmt24 to 55 languages & dialects. arXiv preprint arXiv:2502.12404 , 2025. [7] Machel Reid et al. Gemini 1.5: Unlocking multimodal unde rstanding across millions of tokens of context. ArXiv , abs/2403.05530, 2024. [8] Zheng Cai et al. Internlm2 technical report, 2024. [9] Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, An tónio Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro Henrique Ma rtins, Antoni Bigata Casademunt, François Yvon, André Martins, Gautier Viaud, C’eline Hudelot, and Pi erre Colombo. Croissantllm: A truly bilin- gual french-english language model. ArXiv , abs/2402.00786, 2024. [10] Fan Gao, Hang Jiang, Rui Yang, Qingcheng Zeng, Jinghui L u, Moritz Blum, Tianwei She, Yuang Jiang, and Irene Li. Evaluating large language models on Wikipedia -style survey generation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics : ACL 2024 , pages 5405–5418, Bangkok, Thailand, August 2024. Associa tion for Computational Linguis- tics. [11] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab hinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vau ghan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. [12] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoy u Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing r easoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025. [13] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, M antas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understandin g, 2021. [14] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensc h, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume L ample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Th ibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. [15] Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao , Wenjie Wang, Fuli Feng, Dayiheng Liu, and Junyang Lin. Hellaswag-pro: A large-scale bilingual be nchmark for evaluating the robustness of llms in commonsense reasoning. arXiv preprint , 2025. [16] Shivalika Singh, Angelika Romanou, Clémentine Fourri er, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Q i Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smit h, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferran te, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global mmlu: Understanding and addressing cul tural and linguistic biases in multilingual evaluation, 2025. [17] Aarohi Srivastava, Abhinav Rastogi, Ajay Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Al onso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of languag e models. TMLR , 2023. 7 Page 8: [18] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuse ppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexan dre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 , 2024. [19] Qwen Team. Qwq-32b: Embracing the power of reinforceme nt learning, March 2025. [20] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpre et Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language under- standing systems. In NeurIPS , 2019. [21] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill , Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural lan guage understanding. In ICLR , 2019. [22] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhrani l Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wa ng, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. [23] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma , Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv , abs/2201.11903, 2022. [24] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zh eng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 , 2024. [25] Rui Yang, Ting Fang Tan, Wei Lu, Arun James Thirunavukar asu, Daniel Shu Wei Ting, and Nan Liu. Large language models in health care: Development, applica tions, and challenges. Health Care Science , 2(4):255–263, 2023. [26] Rui Yang, Qingcheng Zeng, Keen You, Yujie Qiao, Lucas Hu ang, Chia-Chun Hsieh, Benjamin Rosand, Jeremy Goldwasser, Amisha Dave, Tiarnan Keenan, et al. Ascl e—a python natural language process- ing toolkit for medical text generation: development and ev aluation study. Journal of Medical Internet Research , 26:e60601, 2024. 8

---