Generating audio...
Extracting PDF content...

arxiv

Paper 2503.13399

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy

Published: 2025-03-17

Abstract:

Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.

Paper Content: on Alphaxiv
Page 1: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research James Burgess*1Jeffrey J Nirschl*1Laura Bravo-S ´anchez1,∗ Alejandro Lozano1Sanket Rajan Gupte1Jesus G. Galaz-Montoya1Yuhui Zhang1 Yuchang Su2Disha Bhowmik3Zachary Coman3Sarina M. Hasan4 Alexandra Johannesson5William D. Leineweber1Malvika G Nair3Ridhi Yarlagadda3 Connor Zuraski1Wah Chiu1Sarah Cohen3Jan N. Hansen1Manuel D Leonetti6 Chad Liu6Emma Lundberg1,5,6Serena Yeung-Levy1,6 1Stanford University,2Tsinghua University,3University of North Carolina at Chapel Hill, 4Princeton University,5KTH Royal Institute of Technology,6Chan Zuckerberg Biohub Network Expert Visual Understanding1 "How is theSeipin localized within the endoplasmic reticulum (ER)?" "Do cells treated with BafA1 (left) express more p26 compared to control (right)?" Whatis unusual about the result? PerceptionHypothesis generation 2 "Which mechanism might explain why ASLV particles show signs of merging in a CryoEM image?" "Which glial cell is likely responsible for this abnormal reticular fiber pattern ?" Why does this happen in my experiment? AssessmentExperimental proposal 3 "What experiment could you perform to test if CCNB1 protein levels relate to cell cycle stages in human A-431 cells?" "What experimental change can increase the likelihood of achieving a high- resolution structure?"How do I test my hypothesis? Action Figure 1. A scientific experimentation workflow drives discovery: researchers analyze experiments, develop hypotheses, and design further experiments to test their ideas. We release MicroVQA, a visual question answering (VQA) benchmark to test these three tasks in the context of biological microscopy. Each of the 1,042 samples is created by a biology expert, and transformed into a multiple choice question (MCQ). Abstract Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in bi- ology. Despite recent advances in multimodal large lan- guage models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college- level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multi- modal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question an- swering (VQA) benchmark designed to assess three reason- ing capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment pro- Correspondence {jmhb,jnirschl,lmbravo }@stanford.edu ∗These authors contributed equallyposal. MicroVQA consists of 1,042 multiple-choice ques- tions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce lan- guage shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based ‘RefineBot’ updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%; models with smaller LLMs only slightly underperform top models, sug- gesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific ar- ticles enhances performance. Expert analysis of chain-of- thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgen- 1arXiv:2503.13399v1 [cs.CV] 17 Mar 2025 Page 2: eralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical re- search. MicroVQA is available here, project here. 1. Introduction The transformative potential of AI in scientific discov- ery lies in its ability to interpret and reason over com- plex, multimodal data while integrating specialist domain knowledge. Recently, multimodal large language models (MLLMs) have demonstrated notable capabilities in data comprehension, recall, and multi-step reasoning across var- ious modalities on exam-style questions [1, 3, 6, 49, 69, 81]. This progress has generated enthusiasm for scientific AI applications, ranging from multimodal chat AI assistants [33, 70, 85, 96] to AI agents capable of orchestrating com- plex experiments [12, 16, 26, 55]. However, while MLLMs perform well on general exam tasks, they often lack the specialized, contextual reasoning needed for real scientific challenges, where tasks require advanced image-based rea- soning, analysis, and hypothesis-driven experimentation. In recent years, efforts to apply scientific MLLMs across a range of domains have laid an essential groundwork. Ad- vances have been achieved in image understanding [41, 43, 47, 50, 57], literature analysis [41, 47, 66, 77], and problem- solving at school and early college levels [53, 87]. In ap- plied settings, AI has facilitated scientific tasks in biology [28, 60, 63, 68], chemistry [12, 38, 55], software [63, 73], and diagnostic pathology and radiology [52]. However, there has been less progress on multimodal tasks requiring reasoning inresearch-level settings. The gap is due in part to a lack of multimodal benchmarks for scientific reasoning – unlike college-level tasks where existing exams can form benchmarks, research tasks require expert curation. Microscopy provides a compelling use case for devel- oping such a benchmark. It requires more than pattern recognition or factual recall – it demands the ability to syn- thesize visual information with experimental context, for- mulate hypotheses, and propose experimental follow-ups. To evaluate these advanced capabilities, we introduce Mi- croVQA, a visual-question answering (VQA) benchmark tailored for multimodal scientific reasoning. Grounded in the real-world scientific practice of biological microscopy, MicroVQA incorporates 1,042 multiple-choice questions (MCQs) manually created by biological researchers across diverse disciplines, with each question taking more than 30 minutes to generate. Careful expert curation ensures the benchmark is relevant to real scientific applications. MicroVQA is designed to evaluate three key reasoning tasks crucial to scientific inquiry: (1) expert image under- standing – the ability to recognize scientifically relevant features and integrate contextual information about sam- ple preparation; (2) hypothesis generation – the formulationof scientifically plausible explanations from experimental data; and (3) experiment proposal – the ability to suggest further experiments that could validate or refute hypotheses. These capabilities form a common workflow in scientific experimentation. They demand both abductive reasoning – inferring the best explanation from multiple possibilities – and deductive reasoning – moving from general principles to specific conclusions [81]. We tested standard methods to map open scientific VQA samples to multiple-choice questions (MCQs), but found that they induce language shortcuts – the MCQs failed to truly test multimodal abilities. To ensure question quality and difficulty, we developed a two-stage pipeline for MCQ generation: first, an optimized LLM prompt structures QA pairs into well-formatted MCQs; then, a novel agent-based ‘RefineBot’ system increases question difficulty by rewrit- ing questions without language shortcuts. This ensures that MicroVQA’s MCQs better test true scientific capabilities, rather than simple pattern recognition. We benchmarked frontier MLLMs on MicroVQA, find- ing that the best performers achieved a modest peak of 53%, which shows a significant gap between current models and expert-level scientific reasoning. The variation between different MLLMs was small; notably, smaller LLMs only slightly underperform their larger counterparts, suggest- ing that the key challenge for our MCQs is not language- only reasoning, but multimodal reasoning or knowledge. We also find that finetuning MLLMs on scientific publica- tions improves MicroVQA performance. Finally, a group of biomedical researchers performed a detailed qualitative analysis into MLLM failure modes, uncovering three ma- jor failure modes: Expert perception is the most common, followed by knowledge errors, and then overgeneralization reasoning errors. This further supports that visual reasoning is a key challenge in MicroVQA. In summary, our contributions are as follows: • We introduce MicroVQA, a benchmark for multimodal reasoning in biological microscopy. We defined three key tasks for scientific research – expert image understanding, hypothesis generation, and experiment proposal. Then, expert researchers curated 1,042 diverse VQA samples. • We develop a novel two-stage MCQ generation pipeline to address the challenge of creating MCQs from long- format scientific questions without language shortcuts. • We provide quantitative and qualitative benchmarking of frontier MLLMs, highlighting areas for improvement. 2. Related work MLLM reasoning benchmarks While perception fo- cuses on identifying an image’s visual features, visual reasoning often integrates prior knowledge to derive new conclusions using logical inference strategies like deduc- tion, induction, abduction, and analogical reasoning [81]. 2 Page 3: Reasoning in MLLMs has focused on image-text VQA [25, 44, 53, 81, 87, 88]. The most prominent examples in science are MMMU [87, 88], ScienceQA, [53] and Math- Vista [54]. While they do test reasoning, they are derived from exams up to the college level, while our benchmark emphasizes research-level difficulty. Other popular multi- modal benchmarks test commonsense inference [71, 90], spatial or relational awareness [35, 91] and integrating prior knowledge [56]. Another line of work is visual abductive reasoning [8, 32], or finding the most likely explanation for an observation from incomplete information; this relates closely to hypothesis generation in this work. MLLMs in science Many works consider multimodal VQA in scientific or medical domains like microscopy im- ages [50], biological images [57], medical images [30, 34], and chemical structure drawings [43]. Still, they mostly test more straightforward perception and recognition. Figure comprehension for scientific articles does require more rea- soning with multimodal data [41, 45–47, 61]; however, the images here are curated for publication, and the emphasis is on understanding content, rather than performing practical tasks. On the other hand, our benchmark includes realistic experiment images and tests more advanced reasoning like hypothesis generation and experiment planning. Scientific tasks have received much more attention in language-only LLMs [9, 15, 22, 37, 41, 58, 62, 63, 66, 67, 73, 80, 95], with GPQA being a notable work targeting PhD-level multi-step reasoning [62], but these do not test multimodal reasoning, which is important to many scientific areas [26]. In terms of models, most MLLMs in the biomedical domain are tar- geted at clinical tasks [34, 42, 59, 65, 75, 92]. Multiple-choice question construction A challenge in creating exams for education research and benchmarks in machine learning is generating multiple-choice questions with quality distractors [4, 27]. Recent works have ex- plored LLM-based generation methods, including zero-shot distractor prompting, few-shot prompting, LLM fine-tuning with high-quality examples, and prompting for known er- ror types [10, 23, 24, 94]. However, a major challenge is that generated distractors are not sufficiently vision-centric – MLLMs can often select the correct answer from the dis- tractors without access to the image [74]. While this is partly explained by language biases on truly vision-centric MCQs [7, 29], others likely do fail to test visual understand- ing due to poor MCQ construction [41] – this informs our RefineBot MCQ generation method in Sec. 4.2. 3. The MicroVQA benchmark Here, we present the MicroVQA benchmark for visual- question-answering (VQA) which advances multimodalTable 1. MicroVQA benchmark attributes. Dataset feature Value Total questions 1,042 Multi-image questions 423 Avg. MCQ question length 66 Avg. MCQ answer length 15 Avg. raw question length 158 Avg. raw answer length 52 Unique image sets 255 Image Modalities Light, Fluoro, Electron Image Scales Tissue, Cell, Subcell, Atomic Organisms 31 Research areas 33 Expert question creators 12 Time to create 1 question 30-40 mins Time to quality check 1 MCQ 5 mins Expert Visual Understanding1Hypothesis generation2Experimental Proposal31.1 Compare image sets1.2 Identify abnormalitiesE.g., How is the mitochondrial morphology different in image 1 vs image 2?E.g., Are the nuclei unhealthy, and what about the shape tells you that?2.1 Causal mechanisms2.2 Functional implicationsE.g., What gene dysregulation could lead to the observed cytoplasm shape?E.g., Given the unexpected localization of the centrosome, what will be the impact on liver function?3.1 Suggest new experiment3.2 Address technical issuesE.g., What new experiment could I do to test if Gene Y is causing these problems?E.g., The staining did not target my objective, and the SNR was low. How can I improve it?MicroVQA taxonomy Figure 2. MicroVQA taxonomy of sub-tasks. scientific reasoning by tackling two major evaluation chal- lenges. First, defining tasks is challenging because scien- tific research encompasses many complex activities [33] – we therefore define three important tasks important to biomedical discovery. Second, VQA samples cannot be readily sourced from existing exams or classification datasets – we therefore assembled a team of biologists to curate challenging reasoning problems. In Sec. 3.1, we pro- vide an overview of the MicroVQA benchmark; Sec. 3.2 details the key tasks; and finally, Sec. 3 demonstrates that MicroVQA is at the forefront of measuring research-level reasoning in MLLMs. In the later Sec. 4, we describe our approach for generating multiple-choice questions. 3.1. Overview of MicroVQA MicroVQA is a comprehensive dataset of 1,042 VQA triplets, manually curated by a team of expert biological re- searchers from diverse fields, and key attributes are sum- marized in Tab. 1. These expert-crafted questions test rea- soning across three key tasks: expert understanding, hy- pothesis generation, and experiment proposal, and we ad- ditionally provide a taxonomy of sub-tasks in Sec. 3. The questions cover a broad spectrum, ranging from biological issues, such as “What is unusual about the cell shape?” – to technical imaging challenges like “Is this feature due to low signal-to-noise ratio (SNR) or is it real?”. We provide the ‘raw’ questions and answers written by experts, which 3 Page 4: are often long and nuanced, along with converted multiple- choice questions (MCQs) suitable for MLLM evaluation. The images encompass the most common microscopy modalities used in human biology: brightfield, fluores- cence, and electron microscopy. The sample types span the full range of microscopic scales – tissue, cellular, sub- cellular, and atomic – and emphasize organisms relevant to human biology and medically motivated tasks, namely hu- man and mouse. Over 60% of the samples have multiple images because comparison is essential in microscopy re- search, and because multichannel fluorescence images can- not be represented in standard RGB. All these attributes are provided as metadata tags to facilitate deeper error analy- sis, thereby enhancing MicroVQA’s value as a resource for advancing AI capabilities in scientific research. 3.2. MicroVQA scientific reasoning tasks To define specific reasoning tasks aimed at advancing biomedical research, we established the following criteria: (1) the tasks should use image-text inputs and text outputs suitable for MLLMs, (2) they should require higher-order reasoning beyond simple image processing, and (3) they should emphasize core experimental activities such as ex- perimentation, analysis, and action [33], rather than aux- iliary tasks like literature review or writing. The tasks were developed through interviews with nine co-author PIs and postdocs specializing in microscopy research (Ap- pendix D). We identified three essential capabilities that we now introduce, with examples shown in Sec. 3. Expert visual understanding After performing an exper- iment, scientists must interpret data within its experimen- tal context. In biological microscopy, this involves identi- fying patterns like protein distributions or cell morpholo- gies, while also assessing technical aspects such as artifacts and noise levels. The task requires MLLMs to perform anomaly detection and image comparison, going beyond simple perception in two key ways: analysis must consider sample preparation context, and expert knowledge is needed to evaluate biological features and technical artifacts. Hypothesis generation The next step is proposing mech- anistic hypotheses to explain experimental data. For exam- ple, when a drug treatment causes mitochondrial fragmen- tation in cells, a hypothesis might be that the drug disrupted calcium homeostasis, activating the DRP1 protein through a multi-step pathway, leading to fragmentation. This re- quires abductive reasoning [8, 20, 32, 81] as one must se- lect from many possible hypotheses given incomplete in- formation. The process demands the MLLM to integrate relevant domain knowledge with experimental context and visual features to reason about which hypothesis best ex- plains the observations.Table 2. Comparing scientific multimodal benchmarks close to MicroVQA for in terms of reasoning level or difficulty beyond college level. We show difficulty level, scientific domain, dataset source, and size. Compared to others, MicroVQA either has higher difficulty level, or it has higher reasoning level (which is established in Fig. 3). Compared to others at the same reasoning level, namely MMMU-Pro, it has similar size. BenchmarkDifficulty level Domain Source Size MicroVQA research microscopy expert-curated 1,042 MMSci [47] research science paper figures 7132 LabBench [41] research biology webQA 181 PathVQA [30] graduate pathology texbooks 16.3k OmnimedVQA∗[34] graduate medical classification dataset 127.9k Microbench [50] graduate microscopy classification dataset 17.2k MMMU [87] undergraduate general textbooks, webQA 11k MMMU Pro [88] undergraduate general MCQ dataset 1,730 Science QA [53] grade-school science exams 16.8k Question Distribution of Bloom's Levels Across BenchmarksBloom’s Levels 5 - Evaluation 4 - Analysis 3 - Application 2 - Comprehension 1 - RecallE.g. In the image, two cells are connected by a red- colored structure. What is this structure called?E.g. Which experiment would confirm the localization of the target protein to motile cilia in ependymal cells?E.g. Which tissue sample appears healthier betweeen the two images? ScienceQAMMMUMMMU-ProLabBench MicroBench (Perception)OmniMed VQAPathVQAMicroVQAGrade- schoolUndergrad. Graduate Research 0.0 20.0 40.0 60.0 80.0 100.0 Proportion of dataset (% )MMSci 44.3 32.9 21.3 48.4 71.9 71.1 41.4 2.5 2.9 40.9 45.8 26.5 51.6 23.4 23.9 24.3 87.0 8.3 9.7 14.9 45.5 2.4 3.3 24.3 9.2 48.2 4.6 6.0 5.9 1.8 1.2 8.8 1.0 32.1 0.5 0.4 0.8 0.5 0.5 1.1 0.3 8.5 Figure 3. Composition of scientific MLLM benchmarks regarding estimated Bloom’s taxonomy [11]. Higher levels are more cogni- tively challenging. MicroVQA has more questions at higher levels compared to other benchmarks, for example, MMMU [87] and ScienceQA [53], while perception-driven medical benchmarks like OmniMedVQA are at lower levels. Experiment proposal The final step is designing exper- iments to validate hypotheses. In microscopy, this often involves selecting appropriate assays and controls, requir- ing knowledge of suitable protocols and deductive reason- ingabout whether proposed experiments will provide ev- idence for or against the hypothesis. This task also may involve adjusting a particular experiment to address techni- cal issues, like how to ensure a good signal-to-noise ratio, 4 Page 5: and this requires both image understanding and knowledge of protocols. In Appendix F.7.1, we show an example question in each task with an expert-annotated ‘reasoning trace’ – it shows the type of reasoning required for typical questions. 3.3. Analysis of MicroVQA benchmark Scientific discovery requires complex reasoning beyond ba- sic perception and knowledge, and it requires expertise beyond school-level education. Here we show that Mi- croVQA addresses a gap that prior multimodal scientific benchmarks do not test high level reasoning atresearch- level difficulty. To assess reasoning levels in MLLM benchmarks, we use Bloom’s taxonomy [11], which hierarchically classifies cognitive difficulty from recall to evaluation. We fine-tune GPT-4o to classify Bloom’s taxonomy on related MLLM benchmarks with results in Fig. 3; in Tab. 2 we addition- ally show key attributes of the same benchmarks. Fig. 3 shows that benchmarks having research-level and graduate- level difficulty are dominated by lower reasoning questions (levels 1-2), while our MicroVQA has higher level reason- ing (levels 3-4). This is not surprising since OmnimedVQA and MicroBench derive questions from prior classification datasets, while our task definition and data collection ap- proach does reflect high-level reasoning. On the other hand, some benchmarks have higher Bloom’s reasoning level closer to MicroVQA – MMMU, MMMU-Pro, and Science-QA – but their difficulty level is lower – undergraduate or grade-school. The most compa- rable benchmark for reasoning is the undergraduate-level MMMU-Pro, and its dataset size is 1,730 compared to our 1,042: this shows that for reasoning-intensive benchmarks, sizes beyond 1,000 are very high. 4. MCQ generation process 4.1. Creation of raw VQA samples by experts Expert generation of raw VQA samples Creating VQA triples suitable for research-level reasoning tasks is time- consuming and requires expertise. We recruited 12 human experts to each create approximately 90 VQA samples, tak- ing about 30–40 minutes per sample. Each submission in- cluded an image set, a question, a detailed response, and contextual information like experimental context and mo- tivation. They are raw VQA triples, (v0, q0, a0)– ‘raw’ means the question and answer can be much longer and more detailed than typical VQA benchmarks. This allowed expert annotators to include enough detail to match what is a valuable input and response for an MLLM. Appendix E shows materials that prompt question creators. The images were sourced from the contributor’s own experiments, im- age databases, or research papers published since 2024, andall image licenses permit redistribution. Quality control We had three quality control criteria: questions should be challenging and not solvable with undergraduate-level knowledge; they should be aligned with the defined tasks; and there should be diversity in im- age content. Questions were reviewed against these criteria for each contributor, and feedback was provided for further VQA samples. These review rounds occurred after submit- ting the first 10 questions, and the first 40 questions. 4.2. MCQ generation from raw VQA samples To benchmark MLLMs, we transform raw expert-created VQA samples to multiple-choice questions (MCQs) and propose new methods for MCQ generation. Motivation: naive MCQ generation is inadequate The standard approach for VQA benchmarks is to zero-shot prompt an LLM with raw VQA samples [4], but this has two problems. The first is that generated MCQs do not follow established educational MCQ design principles (Ap- pendix E.2), likely because the raw VQA questions and an- swers are often long and variable. Our stage 1 ‘exam align- ment’, which we describe in the next section, addresses this challenge. The second challenge is that generated MCQs do not properly test MLLM capabilities. Evaluating GPT- 4o on these MCQs scores 93% even without the image , de- spite most of the dataset questions clearly requiring image analysis, and we investigate this by inspecting the chain- of-thought outputs in Fig. 4 and Appendix E.2. Many dis- tractors can be easily eliminated based on general biology knowledge, or because they are too vague compared to the correct option. We hypothesize that MCQ and distractor generation is challenging when the target answer string is long, and where the subject matter is specialized, suggest- ing that other scientific benchmarks may encounter similar issues. Overall, this motivates a more involved approach to distractor generation. Stage 1: Exam-style MCQ generation Here we ensured that multiple-choice questions (MCQs) conformed to estab- lished design principles for biomedical exams (Fig. 4). This effort was led by a physician-scientist co-author trained in medical board exam item writing, who reviewed edu- cational literature on MCQ design [64] and Bloom’s Tax- onomy for assessing cognitive skills in biomedical exams [5, 19, 89]. We began by manually transforming 50 user- submitted raw question-answer pairs, (q0, a0), into one- best-answer MCQs with kdistractors, (q1, a1,d1). These ‘Gold Standard’ MCQs retained the original questions’ meanings while strictly adhering to NBME standards [64] and minimizing cues that enable test-wise guessing. We used these gold-standard MCQs in a supervised learning 5 Page 6: Question:“A fluorescence microscopy image showing eukaryotic cellswith a target protein highlighted in the green channel was reviewed. From the visual data, what can be inferred about the localization and expression levels of the protein within these cells?A) The target protein is primarily observed in the nuclear region with limited cytoplasmic presence, with fluctuations in expression between cells.B) The protein exhibits predominantlocalization across the cell surface, maintaining uniform intensity in all cells.C) A notable cytoplasmic accumulation of the protein with consistent signal levels is observed across all cells.D) The protein is detected evenly throughout boththe cytoplasm and nucleus but shows significant variability in the cellular signal intensity.E) The fluorescence strongly indicates localization in the cytoplasmic matrix, with a concurrent uniformity in expression intensity among the observed cells.Correct answer: AReview and feedbackRaw VQA creation0Exam-style MCQ generationDefine taxonomy with experts1Agent-based MCQ refinement2Question collection Estimated expert time: ~600 h ContextGoldstandard GeneratedMetricsSupervised prompt optimization (DSPy)Goldstandard Optimized prompt RewriterCheckerEvaluator + Reflector FeedbackEasyHard Question:“A fluorescence microscopy image of SiHacellsincubated with an antibody against a target protein (green channel) was analyzed. Based on the image, what can be inferred about the subcellular localizationand expression levels of the target protein?A) The target protein is exclusively localized in the nucleus with uniform expression across all cells.B) The target protein is predominantly expressed in the cytoplasm with variable expression levels across cells.C) The target protein is mainly expressed in the nucleus and less in the cytoplasm, with single-cell variation.D) The target protein is evenly distributed throughout the cellwith consistent expression levels in all cells.Correct answer: CRaw datasetRaw questionQuestion:The image shows SiHacells that have been incubated with an antibody against a target protein (green channel) and imaged using fluorescence microscopy. What can you tell about the subcellular location and expression levelsof the target protein based on this image?Answer:“The target protein is mainly expressed in the nucleus, and to a lesser degree, the cytoplasm. Its expression appears to differ between the cells, as there is single-cell variation in the signal intensity both in the nucleus and cytosol.MCQ stage 1MCQ stage 2 Expert review Exam-style refinement +Optimized promptRaw datasetMCQ1MCQ {1,1*}MCQ2MCQ1* Long answerHard distractor: plausible and subtlePreserves QA objective Easy distractor: too vagueRemoves unnecessary detail RefineBot Hard distractorUnderspecified compared to answer Justification to answerUnnecessarydetail Answer more conciseFigure 4. Constructing the MicroVQA multiple choice questions. (0) We defined tasks with domain biological scientists and created 1,061 raw VQA samples. (1) The raw samples were aligned to an exam-style MCQ by manually transforming a small set and optimizing an LLM prompt to match that alignment. (2) MCQs are further improved using RefineBot, a new iterative method to make MCQs more challenging. The lower panel shows an example MCQ from raw VQA to final: the annotations highlight key changes that we further explore in Appendix E.2, where red indicates issues, and green indicates good attributes. setup to develop a general LLM prompt, p, that maps raw question-answer pairs to exam-style MCQs. We lever- age the DSPy framework [39, 40], which optimizes pto match the gold-standard MCQ outputs. Additionally, the optimization process incorporated supervision from LLM- based quality metrics: content similarity, NBME-aligned formatting, and the absence of extraneous clues. Further details are in Appendix E.2. Stage 2: RefineBot question refinement Although Stage 1 produces exam-aligned MCQs, many remain easily solv- able due to language shortcuts (see Appendix D). To en- hance difficulty, we introduce RefineBot , a method that in- creases MCQ complexity (Fig. 4). The key idea is that weaknesses in MCQ construction are revealed by the chain- of-thought (CoT) answers. The first LLM agent, the eval- uator/reflector , answers the MCQ with CoT and then re- flects on the strategies used. For example, a question about processes inside a virus might have a distractor referring to surface processes – this can be eliminated without refer- encing the image. The LLM summarizes its solution strat- egy into a reflection text: in this case, that the distractor was implausible based on knowledge that the stated processdoes not not occur inside the virus, and so does not match the question. It passes the reflection to the rewriter LLM agent. The rewriter revises the question stem and gener- ates new distractors to invalidate the identified strategies: in this example it may create distractors referring to pro- cesses that do occur inside a virus. To prevent significant changes to the question-answer pair over iterations, we em- ploy an LLM checker to ensure semantic consistency with the original pair. The revised MCQ is returned to the eval- uator; if it still answers correctly, the loop continues. The process stops if the rewriter fails the checker or after nit- erations. If it fails, RefineBot can be rerun with a different seed, often succeeding—a form of inference-time scaling [13]. All agents are GPT-4o-0806, but to mitigate potential bias against 4o in final evaluation, we also use Claude-3.5- Sonnet-0620 as the evaluator agent; we assess biases in the experiments section. Final MCQ quality check Since MCQ generation involves LLM processing, we need to verify the correctness of the final answer. Each MCQ is manually reviewed by the same expert who created that particular question (between 80 and 120 per person). For any question with issues, that expert makes minimal changes to the question correct it. 6 Page 7: 5. Experiments 5.1. Benchmarking MLLMs with MicroVQA We evaluate state-of-the art multimodal large language models (MLLMs) on the MicroVQA benchmark. We in- clude open and closed models spanning the categories: rea- soning, large, small, and medical – medical is the closest domain to microscopy with specialist MLLMs. We uti- lize standard chain-of-thought prompting [88] (deatails in Appendix F). We report mean accuracy in multiple-choice VQA for each tasks Tab. 3. We also perform and analyze no-image Appendix F.5. MicroVQA is challenging for all MLLMs Our evalu- ation reveals a substantial gap between current MLLMs and the upper bound, with the leading model, o1, at 52.8. There is surprisingly little variation in performance be- tween models, with most closed and open models all scor- ing above 40, with the lowest score (Llama-3.2-11b) likely due to instruction-following issues. A human baseline (Ap- pendix F.3) scores only 50.3 – this shows that biology ex- perts are specialized to subdomains, and are challenged by different subdomains. In fact this demonstrates that experts could benefit from MLLMs that can solve MicroVQA’s tasks. The performance variation across the 3 tasks is small – the highest score is 56.4 on ‘expert visual understanding (V)’, compared to the highest overall of 52.8. The ‘hy- pothesis generation’ task is the hardest for all models and the gap is strongest for smaller models. For example the gap between hypothesis generation and expert visual under- standing is 3.7 for the strongest large model, but 10.7 on the strongest small model. Smaller models are surprisingly competitive For all models with a lower-parameter equivalent – Gemini-1.5, QwenVL, VILA, and Llama – the drop in performance due to size is less than 3 for all except Llama. Moreover, these drops are smaller than on other multimodal reason- ing benchmarks like MMMU, where for example, Gemini drops by 9.6 between Pro and Flash-8B [87]. Typically, smaller models have the same vision encoder, but a smaller LLM [79]. Since the size of the language model has a small impact, we hypothesize that for the ‘solved’ part of MicroVQA, the language-only reasoning is relatively sim- ple compared to domains like math that require multi-step logic. This suggests that other aspects of MLLM reasoning are more challenging, like multimodal reasoning. If true, this suggests that future work could focus on stronger image representations, and this idea is supported by the qualitative error analysis in Sec. 5.2. Specialist biomedical training does improve perfor- mance While no specialist MLLMs have been developedTable 3. VQA accuracy on MicroVQA by task: expert visual un- derstanding (V), hypothesis generation (H), experiment proposal (E). Models marked∗were used in MCQ generation, which may affect comparative performance (see Sec. 5.1). The model†is the base LlaV A for LLaV A-Med. Best values are bolded. Model Overall V H ERo1 [36] 52.8 55.4 50.2 53.0large models∗Claude-3.5-Sonnet [6] 51.7 54.1 50.2 50.4 Gemini-Pro-1.5 [69] 51.1 52.0 50.2 50.9 Pixtral-Large [2] 49.8 50.8 49.5 48.7 Grok-2-Vision [84] 48.4 50.3 46.4 48.7 Qwen-2-vl-72b-Instruct [79] 47.5 49.2 45.7 47.8 VILA1.5-40b [48] 47.5 47.2 47.9 47.4 ∗GPT-4o [1] 45.6 48.7 43.1 44.8 Llama-3.1-Nemotron-70b-Instruct [83] 44.2 44.9 43.3 44.8 Llama-3.2-90b-Vision-Instruct [21] 42.4 44.9 42.1 38.7small modelsQwen-2-VL-7b [79] 48.8 54.1 43.3 49.6 Claude-3.5-Haiku [6] 47.1 48.0 43.8 51.7 Gemini-Flash-1.5-8b [69] 46.7 48.7 43.6 49.1 GPT-4o-mini [1] 46.2 48.5 43.6 47.0 Pixtral-12b [2] 45.6 46.9 44.8 44.8 VILA1.5-13b [48] 41.8 41.8 47.5 40.9 Llama-3.2-11b-vision-instruct [21] 30.3 32.4 29.3 28.7medicalLLaV A-Med-Mistral-7B [42] 43.0 37.3 47.1 41.6 †LLaV A-Mistral-7B [49] 39.8 31.6 43.1 37.1 Random 22.0 21.9 21.8 21.9 Human 50.3 52.7 47.5 51.4 Table 4. Ablation study on MicroVQA MCQ generation stages (shown in Fig. 4). Accuracy is high because MCQs have shortcuts (Sec. 4.1) after ‘Stage 1’ exam alignment, but is lower after ‘Stage 2’. Final column is the relative decrease in accuracy. Models with ∗were used in Stage 2 generation and have the biggest accuracy drops ( bolded ). They are grouped with different models from the same provider in∗∗. Stage 1 Stage 2 Relative drop ∗GPT-4o 79.7 46.8 -41.2 ∗∗GPT-4o-mini 75.6 46.2 -39.0 ∗Claude-3.5-Sonnet 82.2 51.7 -37.1 ∗∗Claude-3.5-Haiku 77.3 47.1 -39.0 o1 81.6 52.8 -35.3 Pixtral-Large 80.1 49.8 -37.8 Gemini-Pro-1.5 79.1 51.1 -35.4 for microscopy research in particular, LLaV A-Med is fine- tuned on scientific articles from PubMed that include data types overlapping with MicroVQA – especially (tissue) pathology. In Tab. 3, we compare LLaV A-Med against it’s base model, LLaV A-Mistral-7B and find overall stronger performance by 4.5 points. This suggests two clear oppor- tunities for improving MicroVQA performance. The first is to instruct-tune with a dataset that better aligns with the MicroVQA data domain: not only tissue pathology, but also fluorescence and electron microscopy, which is available in scientific articles [93]. The second is to simply begin train- ing from a stronger open model like Pixtral. Beyond that, there is more to explore in MLLM specialization, such as tuning the base image encoder for microscopy [51]. 7 Page 8: Models Accuracy Accuracy23451.1 Compare image sets1.2 Identify abnormalities2.1 Causal mechanisms2.2 Functional implications3.1 Suggest experiment3.2 Address technical issueGemini-1.5-ProVILA1.5-40BLlaVA-Med-Mistral-7BBloom’s LevelSub-task TaxonomyFigure 5. Performance by sub-task and Bloom’s level for best models: Gemini-1.5-Pro (closed source), VILA1.5-40B (open- source), and LlaV A-Med-Mistral-7B (medical). RefineBot MCQ generation is very effective but intro- duces small model biases We introduced a new method for refining MCQs to make them more challenging – Re- fineBot. Tab. 4 shows that between stage 1 and stage 2 (be- fore and after RefineBot), the relative accuracy for all mod- els drops by between 35% and 42%. Firstly, this validates that RefineBot is a valuable tool for making benchmarks harder. However, the most significant drops are for models used by RefineBot (GPT-4o & Claude-3.5-Sonnet, and their smaller versions, GPT-4o-mini & Claude-3-Haiku). The ad- versarial process introduced a small bias against these mod- els (and we hypothesise the bias would be higher if using only one model instead of two). Despite making fair eval- uation slightly more challenging, our use of frontier LLMs in the refinement process is well-motivated, as these mod- els are best positioned to generate challenging, high-quality MCQs that probe the boundaries of current capabilities. Moreover, the bias seems modest – less than 10 points of relative drop – while the benefit is a huge increase in diffi- culty across all test MLLMs, as seen in Tab. 4. Metadata-based analysis supports deeper error analy- sis Fig. 5 shows error rates based on attribute tag. In Ap- pendix F we explore results further – namely that multi- image reasoning has surprisingly good performance, higher level Bloom’s questions are harder, and (not shown) rare image modalities perform worse. 5.2. Error analysis on MicroVQA To understand model errors, three biomedical experts re- viewed the Chain-of-Thought (CoT) responses of 30 sam- ples on Claude 3.5 Sonnet, with each review requiring at least 45 mins of careful analysis. Errors were classified into a major category and minor contributing errors were alsonoted. Expert perception accounted for 50% of errors, 30% were knowledge errors, and 13% were due to overgeneral- ization; the final 7% were text hallucination or general rea- soning error. Appendix F.6 shows examples for the most common types. We expand on this analysis with an auto- mated version across the complete dataset in Appendix F.7. Expert perception errors occurred when the model mis- interpreted visual features, leading to incorrect inferences. For example, in Fig. 13, the model judges the small electron-dense objects as ribosomes, rather than stress gran- ules – ribosomes tend to be smaller, have a more regular shape, and lower contrast. Language bias may have con- tributed to the preference for ribosomes since they are very commonly studied in EM literature available to LLM train- ing. Overall, perception errors were the dominant error type (50%), which suggests that future methods should improve vision representations in MLLMs for microscopy. Knowledge errors highlight gaps in nuanced biomedical knowledge. One example in Fig. 17 deals with understand- ing how protein localization changes during cell signaling. An experiment gives information about a transmembrane signaling protein, and the question asks to interpret the im- age and explain the function in these cells. The model cor- rectly perceives a punctate distribution, but assumes the role of coordinating intracellular vesicle trafficking. This repre- sents a shallow interpretation that could have been improved with specialized biomedical knowledge. These errors could be mitigated by improving knowledge in MLLMs, either in training or as a database at inference. Overgeneralization errors (and simplification errors) re- flect the model’s tendency to apply broad scientific princi- ples without regard for specific context – arguably they are reasoning errors. In one example in Fig. 21, the question is to compare the risk of malignancy of a tumor, schwannoma, to other nerve sheath tumors. During reasoning, the model proposes a simplified question – what is true about schwan- noma – which it answers while ignoring the comparison. 6. Conclusion MLLMs hold great potential for advancing scientific dis- covery. We aim to inspire the development of broadly appli- cable scientific AI systems. Future directions include train- ing models for more robust image representations, integrat- ing knowledge bases, exploring open evaluation methods, and studying LLMs’ reasoning over hypotheses. Beyond microscopy, we hope our benchmark serves as a blueprint for AI evaluation in expert-driven domains. More generally we aspire for different application domains – biomedicine, chemistry, materials science – to share methods towards the common goal of building scientific AI systems. 8 Page 9: References [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. 2, 7 [2] Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Am ´elie H ´eliou, Paul Ja- cob, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073 , 2024. 7 [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022. 2 [4] Elaf Alhazmi, Quan Z Sheng, Wei Emma Zhang, Mu- nazza Zaib, and Ahoud Alhazmi. Distractor generation for multiple-choice questions: A survey of methods, datasets, and evaluation. arXiv preprint arXiv:2402.01512 , 2024. 3, 5 [5] Lorin W. Anderson and David R. Krathwohl, editors. A Tax- onomy for Learning, Teaching, and Assessing. A Revision of Bloom’s Taxonomy of Educational Objectives . Allyn & Ba- con, New York, 2 edition, 2001. 5, 8 [6] Anthropic. The claude 3 model family: Opus, sonnet, haiku. Model card, Anthropic, 2024. 2, 7 [7] Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or abduction: How do llms answer multiple-choice questions without the question? arXiv preprint arXiv:2402.12483 , 2024. 3, 18 [8] Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739 , 2019. 3, 4 [9] Owen Bianchi, Maya Willey, Chelsea X Alvarado, Benjamin Danek, Marzieh Khani, Nicole Kuznetsov, Anant Dadu, Syed Shah, Mathew J Koretsky, Mary B Makarious, et al. Cardbiomedbench: A benchmark for evaluating large lan- guage model performance in biomedical research. bioRxiv , pages 2025–01, 2025. 3 [10] Semere Kiros Bitew, Johannes Deleu, Chris Develder, and Thomas Demeester. Distractor generation for multiple- choice questions with predictive prompting and large lan- guage models. arXiv preprint arXiv:2307.16338 , 2023. 3 [11] Benjamin S. Bloom. Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: The Cognitive Domain . David McKay Co Inc, New York, 1956. 4, 5 [12] Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature , 624(7992):570–578, 2023. 2 [13] Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute withrepeated sampling. arXiv preprint arXiv:2407.21787 , 2024. 6 [14] James Burgess, Jeffrey J Nirschl, Maria-Clara Zanellati, Alejandro Lozano, Sarah Cohen, and Serena Yeung-Levy. Orientation-invariant autoencoders learn robust representa- tions for shape profiling of cells and organelles. Nature Com- munications , 15(1):1022, 2024. 7 [15] Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Yongge Li, Mu- jie Lin, Shuwen Yang, et al. Sciassess: Benchmarking llm proficiency in scientific literature analysis. arXiv preprint arXiv:2403.01976 , 2024. 3 [16] Anne E Carpenter, Beth A Cimini, and Kevin W Eliceiri. Smart microscopes of the future. nature methods , 20(7):962– 964, 2023. 2 [17] Nathan H Cho, Keith C Cheveralls, Andreas-David Brunner, Kibeom Kim, Andr ´e C Michaelis, Preethi Raghavan, Hiro- fumi Kobayashi, Laura Savy, Jason Y Li, Hera Canaj, et al. Opencell: Endogenous tagging for the cartography of human cellular organization. Science , 375(6585):eabi6983, 2022. 7 [18] Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, Michael Pringle, et al. The cancer imaging archive (tcia): maintaining and operating a public informa- tion repository. Journal of digital imaging , 26:1045–1057, 2013. 7 [19] Alison Crowe, Clarissa Dirks, and Mary Pat Wenderoth. Bi- ology in bloom: implementing bloom’s taxonomy to en- hance student learning in biology. CBE Life Sci. Educ. , 7 (4):368–381, 2008. 5, 8 [20] Igor Douven. Abduction, 2011. 4 [21] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. 7 [22] Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi- level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098 , 2024. 3 [23] Wanyong Feng, Jaewook Lee, Hunter McNichols, Alexan- der Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, and Andrew Lan. Exploring automated distractor generation for math multiple-choice questions via large lan- guage models. arXiv preprint arXiv:2404.02124 , 2024. 3 [24] Nigel Fernandez, Alexander Scarlatos, Simon Woodhead, and Andrew Lan. Divert: Distractor generation with vari- ational errors represented as text for math multiple-choice questions. arXiv preprint arXiv:2406.19356 , 2024. 3 [25] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394 , 2023. 3 [26] Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha 9 Page 10: Ektefaie, Jovana Kondic, and Marinka Zitnik. Empower- ing biomedical discovery with ai agents. arXiv preprint arXiv:2404.02831 , 2024. 2, 3 [27] Mark J Gierl, Okan Bulut, Qi Guo, and Xinxin Zhang. Developing, analyzing, and using distractors for multiple- choice tests in education: A comprehensive review. Review of educational research , 87(6):1082–1116, 2017. 3, 8 [28] Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864 , 2025. 2 [29] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 3, 13, 18 [30] Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 , 2020. 3, 4 [31] Marco Y Hein, Duo Peng, Verina Todorova, Frank Mc- Carthy, Kibeom Kim, Chad Liu, Laura Savy, Camille Januel, Rodrigo Baltazar-Nunez, Madhurya Sekhar, et al. Global or- ganelle profiling reveals subcellular localization and remod- eling at proteome scale. Cell, 188(4):1137–1155, 2025. 7 [32] Jack Hessel, Jena D Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, and Yejin Choi. The abduction of sherlock holmes: A dataset for visual abductive reasoning. In European Conference on Computer Vision , pages 558–575. Springer, 2022. 3, 4 [33] Tom Hope, Doug Downey, Daniel S Weld, Oren Etzioni, and Eric Horvitz. A computational inflection for scientific dis- covery. Communications of the ACM , 66(8):62–73, 2023. 2, 3, 4, 6 [34] Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 22170–22183, 2024. 3, 4, 2, 7 [35] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3 [36] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 , 2024. 7 [37] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 , 2019. 3 [38] Yeonghun Kang and Jihan Kim. ChatMOF: an artificial intel- ligence system for predicting and generating metal-organic frameworks using large language models. Nat. Commun. , 15 (1):4705, 2024. 2[39] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and lan- guage models for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024 , 2022. 6 [40] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 , 2023. 6 [41] Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research. arXiv preprint arXiv:2407.10362 , 2024. 2, 3, 4 [42] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems , 36, 2024. 3, 7 [43] Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, et al. Chemvlm: Exploring the power of multimodal large language models in chemistry area. CoRR , 2024. 2, 3 [44] Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, and Long Chen. A survey on multimodal benchmarks: In the era of large ai models. arXiv preprint arXiv:2409.18142 , 2024. 3 [45] Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231 , 2024. 3 [46] Shengzhi Li and Nima Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349 , 2023. [47] Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehen- sion. arXiv preprint arXiv:2407.04903 , 2024. 2, 3, 4 [48] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , pages 26689–26699, 2024. 7 [49] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36, 2024. 2, 7 [50] Alejandro Lozano, Jeffrey Nirschl, James Burgess, San- ket Rajan Gupte, Yuhui Zhang, Alyssa Unell, and Serena Yeung. Micro-bench: A microscopy benchmark for vision- language understanding. Advances in Neural Information Processing Systems , 37:30670–30685, 2024. 2, 3, 4 [51] Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah 10 Page 11: Aklilu, Austin Wolfgang Katzer, Collin Chiu, et al. Biomed- ica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific litera- ture. arXiv preprint arXiv:2501.07171 , 2025. 7 [52] Ming Y ”Lu, Bowen Chen, Drew F K Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, Amr Soliman, Chengkuan Chen, Tong Ding, Judy J Wang, Georg Gerber, Ivy Liang, Long Phi Le, Anil V Parwani, Luca L Weishaupt, and Faisal” Mahmood. ”a multimodal generative AI copi- lot for human pathology”. ”Nature” , 634(8033):”466–473”, 2024. 2 [53] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 2, 3, 4 [54] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 , 2023. 3 [55] Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldas- sari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence , pages 1–11, 2024. 2 [56] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages 3195–3204, 2019. 3 [57] M Maruf, Arka Daw, Kazi Sajeed Mehrab, Harish Babu Manogaran, Abhilash Neog, Medha Sawhney, Mridul Khu- rana, James P Balhoff, Yasin Bakis, Bahadir Altintas, et al. Vlm4bio: A benchmark dataset to evaluate pretrained vision- language models for trait discovery from biological images. arXiv preprint arXiv:2408.16176 , 2024. 2, 3 [58] Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, et al. Are large language models superhuman chemists? arXiv preprint arXiv:2404.01475 , 2024. 3 [59] Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medi- cal artificial intelligence. Nature , 616(7956):259–265, 2023. 3 [60] Siddharth Narayanan, James D Braza, Ryan-Rhys Grif- fiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G Rodriques, et al. Aviary: training language agents on challenging scientific tasks. arXiv preprint arXiv:2412.21154 , 2024. 2 [61] Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. Spiqa: A dataset for multimodal ques- tion answering on scientific papers. arXiv preprint arXiv:2407.09413 , 2024. 3[62] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google- proof q&a benchmark. arXiv preprint arXiv:2311.12022 , 2023. 3 [63] Lo ¨ıc A Royer. Omega—harnessing the power of large lan- guage models for bioimage analysis. nature methods , 21(8): 1371–1373, 2024. 2, 3 [64] Billings Melissa S, DeRuchie Kristine, Hussie Kieran, Kulesher Allison, Merrell Jacquelyn, Morales Amy, Pani- agua Miguel A, Sherlock Jennifer, Swygert Kimberly, and Tyson Julie. Nbme item-writing guide: Constructing written test questions for the health sciences, 2021. 5, 8 [65] Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gem- ini models in medicine. arXiv preprint arXiv:2404.18416 , 2024. 3 [66] Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Manvitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge. arXiv preprint arXiv:2409.13740 , 2024. 2, 3 [67] Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific re- search. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 19053–19061, 2024. 3 [68] Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab: Ai agents design new sars- cov-2 nanobodies with experimental validation. bioRxiv , pages 2024–11, 2024. 2 [69] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023. 2, 7 [70] Surendrabikram Thapa and Surabhi Adhikari. Chatgpt, bard, and large language models for biomedical research: oppor- tunities and pitfalls. Annals of biomedical engineering , 51 (12):2647–2651, 2023. 2 [71] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5238–5248, 2022. 3 [72] Peter J Thul, Lovisa ˚Akesson, Mikaela Wiking, Diana Mahdessian, Aikaterini Geladaki, Hammou Ait Blal, Tove Alm, Anna Asplund, Lars Bj ¨ork, Lisa M Breckels, et al. A subcellular map of the human proteome. Science , 356(6340): eaal3321, 2017. 7 [73] Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kit- tithat Krongchon, Yao Li, et al. Scicode: A research coding benchmark curated by scientists. arXiv preprint arXiv:2407.13168 , 2024. 2, 3 11 Page 12: [74] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian- 1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860 , 2024. 3, 18 [75] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai. NEJM AI , 1(3):AIoa2300138, 2024. 3 [76] Matheus P Viana, Jianxu Chen, Theo A Knijnenburg, Ritvik Vasan, Calysta Yan, Joy E Arakaki, Matte Bailey, Ben Berry, Antoine Borensztejn, Eva M Brown, et al. Integrated intra- cellular organization and its variations in human ips cells. Nature , 613(7943):345–354, 2023. 7 [77] David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, et al. Sciriff: A resource to enhance language model instruction-following over scientific literature. arXiv preprint arXiv:2406.07835 , 2024. 2 [78] Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive bench- mark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 , 2024. 17 [79] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024. 7 [80] Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large lan- guage models. arXiv preprint arXiv:2307.10635 , 2023. 3 [81] Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805 , 2024. 2, 3, 4 [82] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574 , 2024. 7 [83] Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with preferences. arXiv preprint arXiv:2410.01257 , 2024. 7 [84] xAI. Bringing grok to everyone, 2024. Accessed: 2025-03- 13. 7 [85] Tong Xie, Yuwei Wan, Wei Huang, Yufei Zhou, Yixuan Liu, Qingyuan Linghu, Shaozhou Wang, Chunyu Kit, Clara Grazian, Wenjie Zhang, et al. Large language models as mas- ter key: unlocking the secrets of materials science with gpt. arXiv preprint arXiv:2304.02213 , 2023. 2[86] Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006 , 2024. 2 [87] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 2, 3, 4, 7 [88] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813 , 2024. 3, 4, 7, 15 [89] Nikki B Zaidi, Charles Hwang, Sara Scott, Stefanie Stallard, Joel Purkiss, and Michael Hortsch. Climbing bloom’s tax- onomy pyramid: Lessons from a graduate histology course. Anat. Sci. Educ. , 10(5):456–464, 2017. 5, 8 [90] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense rea- soning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6720–6731, 2019. 3 [91] Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song- Chun Zhu. Raven: A dataset for relational and analogical vi- sual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5317– 5327, 2019. 3 [92] Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D Davi- son, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine , pages 1–13, 2024. 3 [93] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 , 2023. 7, 3, 17 [94] Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. arXiv preprint arXiv:2501.03225 , 2025. 3 [95] Haiteng Zhao, Chang Ma, FangZhi Xu, Lingpeng Kong, and Zhi-Hong Deng. Biomaze: Benchmarking and enhanc- ing large language models for biological pathway reasoning. arXiv preprint arXiv:2502.16660 , 2025. 3 [96] Zhiling Zheng, Oufan Zhang, Ha L Nguyen, Nakul Rampal, Ali H Alawadhi, Zichao Rong, Teresa Head-Gordon, Chris- tian Borgs, Jennifer T Chayes, and Omar M Yaghi. Chatgpt research group for optimizing the crystallinity of mofs and cofs. ACS Central Science , 9(11):2161–2170, 2023. 2 12 Page 13: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Supplementary Material Appendices A-C include acknowledgments, author contributions, ethics, and discussion of limitations and future work. The remaining sections correspond to sections in the paper. Appendix D is benchmark details from the main Sec. 3. Ap- pendix E is MCQ generation details from the main Sec. 4. Appendix F is the experiments details from the main Sec. 5. Table of Contents A . Acknowledgments and author contributions 1 B . Ethics 2 C . Limitations and future work 2 D . Benchmark details 3 D.1 . Accessing MicroVQA benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 D.2 . Dataset Schema and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 D.3 . Training contamination mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 D.4 . Benchmark attribute details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 D.5 . Related benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 D.6 . Bloom’s taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 D.7 . Methodology for proposing tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 D.8 . Guidance for question creators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 D.9 . Microscopy image representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 E . MCQ generation details 7 E.1. Question collection process details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 E.2. Experiment on naive MCQ generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 E.3. Stage 1 MCQ generation (exam alignment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 E.4. Stage 2 MCQ generation (RefineBot) details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 E.5. MCQ generation results: stage 1 exam alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 E.6. MCQ generation results: stage 2 RefineBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 F. Experiments details 15 F.1. Evaluation prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 F.2. Model details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 F.3. Human baseline on MicroVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 F.4. More experiments on attribute-level information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 F.5. Language shortcut ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 F.6. Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F.7. Reasoning trace analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A. Acknowledgments and author contributions Acknowledgements We thank Christian Kindermann for feedback on motivation for benchmarking and reasoning tax- onomies; Siddharth Doshi for feedback on motivation for applications in scientific chat systems; and Krisjanis Mazurs for feedback on structuring user interviews. We also thank Assaf Zaritsky, Andrew S Moore, Pedro Guedes Dias, Angus Toland, Helen Healy, and Andrew Kassianos, and Loic Royer for discussions on defining the VQA tasks. 1 Page 14: Funding acknowledgements We gratefully acknowledge funding from HAI Hoffman-Yee (E.L., S.Y-L., J.M.H.B.), Schmidt Futures (E.L.), the Bridge2AI Program (NIH Common Fund; OT2 OD032742; E.L.,), G ¨oran Gustafsson Foun- dation and Knut and Alice Wallenberg Foundation (KAW 2021.0346; E.L.) as well as to MU (HPA). S.C. received funding from NIH/NIGMS R35GM133460 and CZI A23-0264-001. J.N.H. was supported by a Postdoctoral Fellowship from EMBO Postdoctoral Fellowship (ALTF 556-2022). Author contributions • Project conception: JB, JN, LBS, AL, SYL • Defining the VQA tasks: JB, JN, JGG, WC, SC, JNH, CL, SYL • Managing benchmark collection processes: JN, JB • Creating benchmark questions: DB, ZC, SMH, AJ, WDL, MGN, RY , CZ, JNH, CL • MCQ conversion pipeline: JB, JN, LBS • Evaluations: JB, SRG, YS • Qualitative error analysis: JN, JGG, CZ • Quantitative error analysis: LBS • Paper writing and figures: JB, JN, LBS • Supervision: SYL, EL, CL, MDL, CNH, SC, WC B. Ethics Ethical use of biomedical data: MicroVQA was developed with a commitment to ethical practices in handling biomed- ical research data. All microscopy images included in the dataset were either original, unpublished data or sourced from open-access articles (e.g., CC-BY licensed) published after January 2024. The dataset does not include patient-identifiable information, and MicroVQA adheres to applicable privacy and research ethics guidelines. The dataset does not constitute human subjects research. Potential societal impacts: Biomedical AI models, including those trained on MicroVQA, have the potential to influence research and healthcare significantly, both positively and negatively. While these models can accelerate scientific discovery, they may also perpetuate biases in training data or result in uneven performance across demographic or biological groups. To mitigate these risks, MicroVQA includes metadata annotations to support the analysis of potential biases and emphasizes diverse data sourcing. We will engage with the research community to address emerging ethical concerns. Data licensing and usage: MicroVQA is distributed under the Creative Commons Attribution-ShareAlike 4.0 Interna- tional (https://creativecommons.org/licenses/by-sa/4.0/) license to promote transparency and collaboration in the research community. This license reflects our efforts to balance open access with the ethical use of data, enabling both academic and commercial applications of the dataset while adhering to original license requirements. Full licensing details are available at https://creativecommons.org/licenses/by-sa/4.0/. C. Limitations and future work While MicroVQA advances research-level reasoning benchmark significantly, we can identify a number of limitations and avenues for future work. Open vs closed evaluation Like most VQA benchmarks, we used multiple-choice (MCQ) evaluation. However down- stream applications will obviously operate in an open setting, without options and requiring more detail. MLLM research does have some open benchmarks [86], though impartial evaluation is a significant challenge. Our early experiments with open prompting suggest that models like GPT-4o tend to give very vague answers, and tend to depend strongly on the text part of the input prompt. Dataset scale Our final dataset had 1042 samples, which is smaller than some other VQA benchmarks like MicroBench [50] and OmniMedVQA [34]. MicroVQA’s scale sufficient to compare overall scores between models, however statistical power starts to become limiting when trying to compare different models on subsets of the dataset. How does this compare to other benchmarks? As we can see from Tab. 2 and Fig. 3, the larger datasets tend to be those with lower-level reasoning according to Bloom’s taxonomy. Their samples tend to be derived automatically from metadata tags or classification labels. Benchmarks with comparably high-level reasoning, like MMMU-Pro, have a similar scale to ours (1,700). In the language- only domain, the research-level reasoning benchmark GPQA has 448 samples. This trend reflects a clear, practical limitation: it is very time-consuming for humans to create difficult reasoning questions, and the pool of capable experts is small. 2 Page 15: Breadth of dataset attributes in microscopy We aimed for MicroVQA to cover a broad range of the microscopy disci- pline; for example, we have samples from all four relevant scales – tissues, cells, subcellular, atomic. However, each human annotator is an expert in a subset of microscopy and biology, so there are practical limits on dataset coverage. We, therefore, chose to focus on the most common imaging modalities – brightfield, fluorescence, electron – while skipping less common modalities like Raman spectroscopy. In terms of sample types, we emphasized those relevant to humans – human and mouse – while having only a few samples from rare organisms like bottlenose dolphin. Coverage of scientific reasoning Looking even more broadly, the motivation of this work is to advance reasoning in scientific research generally. However, we use microscopy in particular. This is because of a clear trade-off in breadth vs depth. Our dataset strikes a good balance, given that microscopy represents a large portion of visual data in biomedicine and biology. Future work can reuse our framework to generate reasoning VQA benchmarks in other fields of visual biomedicine and biology, such as in medicine [34] or ecology [93]. Resolving MCQ shortcuts A key contribution in this paper is the RefineBot for generating multiple choice questions (MCQs) that are too challenging to solve using text-only shortcuts. Our approach used the outputs of existing MLLMs as signal to make distractors harder. However, this cannot guarantee the removal of shortcuts; more generally, harder distractors may exist. Therefore, the next generation of LLMs may be able to identify and exploit new shortcuts. If that occurs, we can run RefineBot again using the new LLM, creating MicroVQA-2. This strategy could be used for other VQA evaluations. D. Benchmark details D.1. Accessing MicroVQA benchmark MicroVQA is an expert-curated benchmark for multimodal scientific reasoning, specifically in microscopy-based biologi- cal research. The dataset comprises 1,042 multiple-choice questions (MCQs) created by biological researchers to represent authentic scientific practice. Each MCQ is paired with an associated microscopy image and metadata describing the experi- mental context. A demonstration subset of 28 instances is available for initial exploration. The MicroVQA dataset is publicly accessible on HuggingFace at https://huggingface.co/datasets/jmhb/microvqa. It is distributed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license, allowing reuse and share-alike redistribution with attribution. The dataset is available in Apache Arrow and Parquet formats for efficient data handling. These formats support fast serialization, memory mapping, and streaming from the web, facilitating seamless integration into machine learning pipelines. D.2. Dataset Schema and Structure The dataset schema defines the fields included for each data instance. Figure 6 provides a schema summary, while a sample data instance is detailed below 7. { "image_id": datasets.Value("string"), "image": datasets.Image(decode=True, id=None), "label": datasets.ClassLabel(), "label_name": datasets.Value("string"), "domain": datasets.Value("string"), "subdomain": datasets.Value("string"), "modality": datasets.Value("string"), "submodality": datasets.Value("string"), "stain": datasets.Value("string"), "microns_per_pixel": datasets.Value("string"), "questions": datasets.Value("string"), } Figure 6. Example data schema. D.3. Training contamination mitigation To mitigate contamination, all data in MicroVQA were sourced from original, unpublished microscopy images or open-access articles (e.g., CC-BY-licensed) published after January 2024. Additionally, canary strings were embedded in the dataset to detect potential contamination during fine-tuning or model evaluation. 3 Page 16: { "image_id": "d54bb7ec-284f-4218-a47d-af87bb371de5", "image": datasets.Image(decode=True, id=None), "label": datasets.ClassLabel(), "label_name": "pathology", "domain": "pathology", "subdomain": "gastrointestinal pathology", "modality": "light microscopy", "submodality": "brightfield microscopy", "stain": "H&E", "microns_per_pixel": "Not provided", "questions": "Question: ‘‘‘A hematoxylin and eosin (H&E)-stained micrograph shows cellular formations within a gastrointestinal tissue sample with unique organization and cellular patterns. What pattern and morphology do these cells most closely align with? A) Bundles of elongated cells with pale pink cytoplasm and uniform oval nuclei in a crisscross pattern B) Circular groups of cells with mucin in central spaces and small, dark nuclei C) Dense arrays of polygonal cells with small nucleoli and vacuolated cytoplasm D) Rows of cells with conspicuous cytoplasm and rounded nuclei E) Diffused arrangement of cells with transparent cytoplasm and irregularly clustered chromatin Correct Answer: A) Bundles of elongated cells with pale pink cytoplasm and uniform oval nuclei in a crisscross pattern‘‘‘ } Figure 7. Example data instance. D.4. Benchmark attribute details The MicroVQA benchmark is designed to evaluate multimodal reasoning capabilities in biomedical research contexts. The dataset incorporates diverse attributes such as organism, research subject, imaging modalities, and biological length scales (image scales) to ensure a comprehensive assessment1. These metadata not only enhance the scientific relevance but also enable analysis of reasoning across different biological and microscopy contexts. This section describes the methodology used to define, compute, and annotate these key attributes. The benchmark comprises 1,042 total questions, including 423 questions containing multiple images combined into pan- els. A collage of sample images is shown in 25. The average MCQ token lengths were computed using the o200k_base tokenizer, with separate calculations for the raw input and formatted multiple-choice questions. Imaging modalities are broadly categorized into light, fluorescence, and electron microscopy. Biological length scales span four levels: tissue (mil- limeter scale), cellular (micrometer scale), subcellular (nanometer scale), and atomic (angstrom scale), enabling reasoning across diverse structural hierarchies. To annotate attributes such as organism, specimen, and research subject, we manually reviewed random samples from the dataset and created a taxonomy to cover the most likely categories. A biomedical domain expert manually labeled these categories for 50 raw input question-answer pairs and performed initial prompt engineering. We used DSPy to optimize the prompt and few-shot examples on the labeled dataset. The input context included the taxonomy of organisms8 and research subjects 9. These annotations were the gold standard to guide DSPy prompt engineering for automated attribute labeling on the full dataset. Prompt instructions included fallback rules (e.g., tagging ”None” for indeterminate attributes) to ensure accuracy and minimize hallucination. The optimized pipeline annotated all questions while ensuring the validity of the dataset attributes. D.5. Related benchmarks In section 3.3 and Tab. 2 we compare MicroVQA to related scientific multimodal benchmarks, focusing on its intersection of high-difficulty and advanced reasoning questions. Beyond these key attributes, we examine the sources of the questions. Most benchmarks rely on existing question repositories, such as exams, web-based QA datasets, textbooks, or other MCQ datasets. This approach bypasses the need for MCQ generation methods, as the questions and options are pre-tested and curated, making these benchmarks more representative of established tasks. Another common strategy involves reformatting classification datasets into MCQs using question templates, which increases task diversity but often limits scenario unique- 4 Page 17: Taxonomy of organisms in biology| # Taxonomy of Organisms in Biology and Biomedical Research: Apis mellifera (Honeybee) Social insect with complex behavior. Keywords: honeybee, social behavior, ecology, pollination. Arabidopsis thaliana (Thale Cress) Model plant organism. Keywords: Arabidopsis, plant genetics, photosynthesis. Caenorhabditis elegans (C. elegans) Transparent nematode worm. Keywords: worm, development, aging, programmed cell death. Danio rerio (Zebrafish) Fish with transparent embryos. Keywords: zebrafish, vertebrate development, embryology. ... Drosophila melanogaster (Fruit Fly) Insect with rapid life cycle. Keywords: fruit fly, genetics, development. Escherichia coli (E. coli) Bacterium used in genetics and molecular biology. Keywords: bacteria, gene expression, plasmids. ... Homo sapiens (Human Cells) Cultured human cells (e.g., HeLa, HEK293, A431, U2OS). Keywords: human cell lines, immortalized cell lines In vitro (None) Purified molecules and reactions in controlled environments. No living cells or organisms. Keywords: purified proteins, purified molecules, cell-free systems ... Figure 8. Examples of taxonomy classes used as context to LLM to assign an organism to a question. A YAML file with the full taxonomy will be released with the code. ness and language variability. In contrast, MicroVQA is built by directly querying experts for challenging, domain-specific research questions. By starting with long form questions and answers, our approach requires an innovative MCQ gener- ation method, ensuring uniquely tailored questions that push the boundaries of domain-specific reasoning and knowledge assessment. D.6. Bloom’s taxonomy Bloom’s Taxonomy is a model for evaluating cognitive skill levels in educational assessments. It has six hierarchical levels of learning: recall, understand/comprehend, apply, analyze, evaluate, and create. Multiple-choice questions can test five of Bloom’s levels but not the sixth level, “create”. These levels provide a structured framework for assessing lower-order and higher-order thinking skills, which are essential for designing robust benchmarks like MicroVQA. Initial attempts to use LLMs to classify Bloom’s levels for biomedical visual question answering (VQA) benchmarks revealed a systematic overestimation of cognitive skill levels. This discrepancy limits the naive use LLMs for automated Bloom’s classification, perhaps due to the nuances of applying Bloom’s taxonomy in domain-specific contexts. To ensure ac- curate classification of Bloom’s level, a physician-scientist with NBME training in item writing and familiarity with Bloom’s classification manually labeled Bloom’s level for 200 examples from public MCQ questions, MicroVQA, MicroBench, and ScienceQA. These examples were used to fine-tune GPT-4o-mini (gpt-4o-mini-2024-07-18) to classify Bloom’s level and align with expert judgments. The fine-tuned Bloom’s classifier was applied to MicroVQA and numerous public science/biomedical benchmarks to assess the cognitive skills evaluated by each question. Each question was mapped to one of Bloom’s levels, providing insights into the distribution of cognitive tasks across the benchmark. This analysis revealed that while existing benchmarks predominantly tested lower-order skills (e.g., recall and comprehension), MicroVQA has a greater emphasis on questions that require higher-order reasoning (e.g., analyzing, evaluating), filling a critical gap in multimodal scientific reasoning 5 Page 18: Taxonomy of research subjects| #Taxonomy of Research Subjects in Biology: Anatomy Study of the structure of organisms and their parts. Keywords: organs, tissues, morphology, physical organization. Biochemistry Chemical processes within living organisms. Keywords: proteins, enzymes, nucleic acids, metabolic pathways. ... Cell and Molecular Biology Structure and function of cells and their molecules. Keywords: cellular processes, gene expression, molecular pathways. ... Developmental Biology Process by which organisms grow and develop. Keywords: embryology, morphogenesis, genetic regulation. ... Genetics Study of heredity and gene function. Keywords: DNA, inheritance, genetic variation, gene expression. Immunology Immune system in health and disease. Keywords: antibodies, immune response, pathogens, vaccines. Microbiology Study of microorganisms. Keywords: bacteria, viruses, fungi, parasites, microbial ecology. Neurobiology Structure and function of the nervous system. Keywords: neurons, brain, neural circuits, cognition. ... Structural Biology Molecular structure of biological macromolecules. Keywords: protein folding, nucleic acid structures, crystallography. ... Figure 9. Examples of taxonomy classes used as context to LLM to assign a research subject to a question. A YAML file with the full taxonomy will be released with the code. assessments. D.7. Methodology for proposing tasks Here, we expand on the task proposal methodology introduced in Sec. 3.2. To define specific reasoning tasks aimed at advancing biomedical research, we conducted a comprehensive expert consultation process. Through structured interviews with nine Principal Investigators and postdocs specializing in microscopy research, including participants from both academia and industry, we developed a framework for tasks that would meaningfully integrate multimodal large language models (MLLMs) into microscopy research workflows. The task selection process followed specific criteria: tasks must utilize image-text inputs and text outputs suitable for MLLMs; they should require higher-order reasoning beyond simple image processing; and they should emphasize core experimental activities (experimentation, analysis, and action) rather than auxiliary tasks like literature review or writing [33]. Our methodology involved a two-phase interview process. The first phase consisted of initial one-hour individual sessions with eight research groups featuring open-ended, unstructured discussions about key challenges in microscopy experiments, potential applications of MLLMs, current workflows and unmet needs, and possible AI system integration points. One week later, we then conducted follow-up interviews to consolidate findings and build consensus. To avoid anchoring biases, we intentionally withheld predefined categories during initial interviews. This approach en- abled unbiased task identification and allowed us to capture a broad spectrum of potential use cases. Through this iterative process, we identified three essential capabilities that align with the scientific life cycle for microscopy experiments: advanced image interpretation, hypothesis generation, and experimental design/hypothesis testing. 6 Page 19: D.8. Guidance for question creators Question creators were given detailed guidance as pdfs, that we share at https://github.com/jmhb0/microvqa under folder benchmark/human collection . During the submission process, we completed two rounds of quality control to ensure difficulty and alignment to the prompts, as described in Sec. 4.1. D.9. Microscopy image representations MLLMs are trained primarily on three-channel RGB images, however microscopy images are more diverse. To ensure align- ment with how common MLLMs process images, we required that each image artifact for a single question was represented as no more than six RGB images. Where multiple images were needed, they could be multiple image files, or a single image that is concatenated. For videos, users could submit up to six frames; for 3d, they could submit slices or z-projections. An important difference is multichannel images. Many fluorescence microscopy images are multichannel, where the image collection ensures that each channel represents a certain structure. For example channel 0 may be nucleus, and channel 1 may be mitochondria. There are a few approaches. Some questions represent this situation as separate grayscale images for each channel. Others show a ‘composite image’, which is a single image where one each channel is mapped to one of the RGB channels; e.g. a common visualization will put nucleus in blue and mitochondria in green. Some questions do both: grayscale images for each channel, along with a composite image merging them. Finally, we allowed screenshots from research papers or image software, since this is a realistic way that future LLM-users would use AI tools. E. MCQ generation details E.1. Question collection process details Data collection used a structured submission form shared with collaborators. The form required users to provide the follow- ing: • Image or Image Set: Users uploaded microscopy images, optionally including multiple images for comparative questions. • Experiment Details: A brief description of the experimental setup, imaging modalities, and any relevant biological context. • Question Text: Each question was aligned with one of three predefined use cases: identifying interesting features, explain- ing mechanisms, or proposing next steps. • Answer and Distractors: Contributors provided a correct answer and biologically plausible incorrect answers. Quality control was performed in multiple rounds. To familiarize users with the task and data requirements, each user submitted an initial form with 4-6 questions for manual quality control and feedback. Feedback was provided to users on improving question specificity, ensuring the questions required image interpretation rather than relying solely on general knowledge. Contributors were encouraged to refine their submissions based on the feedback to better align with the tasks and goals of the MicroVQA benchmark. Users were encouraged to submit data from multiple sources, imaging modalities (e.g., fluorescence, electron microscopy), and biological scales (e.g., tissue, cellular, subcellular) to promote diversity. Contributors could upload microscopy images from their experiments or use CC-BY publicly available images. Common image sources included preprints or publications released after January 2024 (to mitigate contamination), or from established image repositories [14, 17, 18, 31, 72, 76]. This ensured that all images were either original or sourced from open-license publications, enabling their use in an open-source benchmark. E.2. Experiment on naive MCQ generation In Sec. 4.2, we discuss how the naive approach to MCQ generation – simply zero-shot prompting GPT – leads to MCQs that are very easy for frontier models to solve. Which ultimately motivated our two-stage MCQ generation approach. Here, we discuss those experiments in more detail. We create a naive version of MicroVQA using a standard prompt [34, 82]: You are an expert in molecular and cell biology, and in microscopy. I will give you an original biology-related question and its answer, your task is to rephrase an equivalent question with identical answer. The question related to an image, and we don’t show the image. Meanwhile, I want to transfer this QA-pair into a multi-choice question. Please generate 5 incorrect options to construct the candidate options. {{QUESTION}} 7 Page 20: {{ANSWER}} Then, we evaluated this naive MicroVQA on the top closed source models with the image, and to study the over-reliance on language shortcuts, we evaluated a text-only version. In the text-only setting, we add the following to the prompt: “If an image is mentioned, ignore this information and try your best to answer the question.”. Tab. 5 shows the results of this experiment. Table 5. Performance on MicroVQA MCQs with naive MCQ generation. We report overall VQA and VQA without the image. VQA VQA-no-image GPT-4o 85.1 82.7 Claude-3.5-Sonnet 91.4 88.4 Gemini-1.5-Pro 88.5 82.4 All models score over 80% on questions that are from a specialized domain. We hypothesize that this reflects a problem with the non-trivial task of MCQ generation from raw VQAs for several reasons. First, the high ‘VQA-no-image’ scores across all models suggest that selecting the correct option does not require image understanding, despite the raw VQA samples being designed to incorporate image-based reasoning. An alternative explanation is that the questions themselves may be too easy for advanced models. Second, we test this alternate explanation by performing open evaluation on a subset of the raw VQA queries– that is, we queried GPT-4o with the questions but without the options. Here, the models rarely gave good responses, even after careful prompting. This suggests that models are leveraging test-taking strategies to eliminate distractors [27]. Third, we examined the chain-of-thought responses in the VQA-no-image case, which suggested the models often eliminate distractors based on information from the text question. Therefore, we conclude that there are issues with the improper generation of the MCQs using the naive strategy. What exactly are the problems with MCQ generation? Qualitatively, the types of errors are the same as those we discuss in Appendix E.6, so we refer readers to that section to get a deeper understanding. E.3. Stage 1 MCQ generation (exam alignment) The first stage transformed raw, open-ended questions and long-form answers into well-structured MCQs aligned with es- tablished principles of biomedical exam design. A physician-scientist trained in medical board exam item writing initiated the process by reviewing biomedical educational literature, NBME guidelines [64], and Bloom’s taxonomy for assessing cognitive skills [5, 19, 89]. The expert manually converted 50 raw question-answer pairs into one-best-answer MCQs with biologically relevant distractors using the literature review and their prior training. This manual effort ensured that the MCQs preserved the original question’s intent and scientific validity while adhering to educational standards. Distractors were crafted to require reasoning and prevent reliance on superficial cues like word association or general knowledge, thus maintaining the depth and complexity of the original content. To scale this process, the expert designed an initial LLM prompt for automated Stage 1 MCQ generation. This prompt and the manually curated training data were input for DSPy-based prompt optimization. Using the MIPROv2 optimizer, the prompt was refined within a Chain of Thought (CoT) framework combined with retrieval-augmented generation (RAG) pipelines powered by o1-mini . Optimization was guided by a custom metric evaluated by a separate LLM judge (o1-mini ). The judge assessed each revised MCQ across three criteria: (1) semantic similarity to the original question and answer, (2) adherence to NBME formatting standards, and (3) minimization of extraneous information that could pro- vide unintended clues. The optimization metric was weighted to prioritize preserving the original meaning and scientific validity, with lesser emphasis on formatting and extraneous text. Additional implementation details are provided in the source code. E.4. Stage 2 MCQ generation (RefineBot) details The goal of RefineBot is to take MCQs that are too easy for MLLMs – especially where they are leveraging language-only shortcuts – and to rewrite them to make them more challenging. It is the third panel in Fig. 4. For more details on what types of improvements are achieved, see Appendix E.6. The complete code for RefineBot is released. A single RefineBot iteration Below, we show the pseudo-code for a single-question revision. The input is a question string, a list of options, and the index of the correct answer in the options. It returns a tuple (code, MCQ) where code is a 8 Page 21: string on whether the run was successful, and MCQ is either the revised multiple choice question if the rewrite was successful or otherwise None . 1def revise_mcq(question: str, choices: list, correct_idx: int, max_iters: int = 5): 2 # Store original version 3 original = {’question’: question, ’choices’: choices, ’correct_idx’: correct_idx} 4 reflections = [] 5 6 for iter in range(max_iters): 7 # Try answering without image 8 result = evaluate_without_image(question, choices, correct_idx) 9 10 if not result.can_answer: 11 code = ’SUCCESS_NO_CHANGE’ if iter == 0 else ’SUCCESS_REWRITE’ 12 return code, (question, choices, correct_idx) 13 14 if iter == max_iters - 1: 15 return ’FAIL_ITERATIONS’, None 16 17 # Reflect on strategies used to solve 18 reflection = analyze_why_answerable(result) 19 reflections += reflection 20 21 # Rewrite based on all reflections from all iterations 22 new_version = rewrite_question(reflections, original) 23 24 # Verify the rewrite didn’t change the meaning meaning preserved 25 if not check_meaning_preserved(original, new_version): 26 return ’FAIL_REWRITE’, None 27 28 # Update current version 29 question = new_version[’question’] 30 choices = new_version[’choices’] 31 correct_idx = new_version[’correct_idx’] 32 33 return ’FAIL_UNKNOWN’, None This process is also described in the main text. One detail to note: the rewrite_question function receives the reflections from all prior iterations, not just the current one. In this way, the rewriting function can consider all possible strategies that have been identified. All LLM calls are to gpt-4o-2024-08-06 . To better illustrate the method, we step through a single example. This way, we can show the prompts and understand responses. 1. The evaluation of a question with the evaluate_without_image prompt and response: Prompt -------------------------------------------------------------------------------- The following question is supposed to be paired with an image. We will not provide the image, so answer to the best of your ability. In an electron microscopy image of a mouse primary embryonic cortical neuron expressing full-length human mHTT (BACHD), "eye-ball" shaped features are observed in the top left corner that appear anomalous. What is the most likely cause of these features? Think step by step and then output the answer in the format of "The answer is (X)" at the end. (a) Aggregation of human mHTT protein within the neuron (b) Ice contamination from liquid nitrogen during sample preparation (c) Formation of intracellular vesicles during neuron development (d) Artifacts arising from excessive heavy metal staining -------------------------------------------------------------------------------- Response (target answer is b) -------------------------------------------------------------------------------- To determine the cause of the "eye-ball" shaped features observed in the electron microscopy image, we can analyze each option step by step: (a) Aggregation of human mHTT protein within the neuron: Aggregations of proteins can indeed appear as anomalous structures under electron microscopy. However, these typically have irregular shapes and may not consistently resemble "eye-ball" shapes. (b) Ice contamination from liquid nitrogen during sample preparation: Ice contamination can occur during sample preparation, particularly if samples are not handled properly when flash-frozen. This can lead to 9 Page 22: anomalous features appearing in microscopy images. These features might appear as circular or irregular shapes, potentially looking like "eye-balls." (c) Formation of intracellular vesicles during neuron development: Intracellular vesicles are normal structures within neurons and other cells. Although they can appear during development, they are typically well characterized and documented, and are less likely to be described as anomalous in the context given. (d) Artifacts arising from excessive heavy metal staining: Over staining can indeed produce artifacts in electron microscopy images. While staining artifacts are plausible, they might not specifically result in " eye-ball" shapes unless the staining process consistently targets specific round structures. Considering the nature of the descriptors ("eye-ball" shaped, anomalous), option (b) ice contamination is often a source of unexpected features in microscopy images that could fit the appearance described, as ice can form circular shapes. Therefore, the most likely cause of these features is (b) Ice contamination from liquid nitrogen during sample preparation. The answer is (b). 2. If the answer is incorrect, then the question is already hard enough, and we can exit. Otherwise, we call the analyze_why_answerable function, which reflects on how this answer was achieved. It continues in the same conversation: -------------------------------------------------------------------------------- User -------------------------------------------------------------------------------- That is correct. Explain how you were able to answer the question without access to the image - only the question_stem and choices. What strategies did you use? Then, briefly summarize these "language-only strategies" into a short list. Finally, say whether your reasoning was closer to: - ’deduction’: you answered with high confidence because you had enough information. - ’abduction’: you answered with lower confidence you made an informed guess. Be concise in your final response. -------------------------------------------------------------------------------- Assistant -------------------------------------------------------------------------------- To answer the question without the image, I relied entirely on analyzing the given information and the plausible characteristics of each option. Here’s a breakdown of my strategy: 1.**Keyword and Context Analysis **: I focused on key descriptors such as "eye-ball" shaped features and their being "anomalous" in an electron microscopy image of neurons expressing human mHTT. This guided me to consider characteristics commonly seen under electron microscopy. 2.**Knowledge of the Subject Matter **: I applied general knowledge about electron microscopy artifacts and protein behavior. This helped me weigh the likelihood of each option based on what typically causes specific image features. 3.**Process of Elimination **: By understanding the nature of each option-including potential artifacts or structures (such as vesicles or ice) in microscopy-I systematically assessed the validity of each scenario in relation to the described features. 4.**Logical Inference **: I inferred which scenario most commonly results in anomalies resembling "eye-ball" features, favoring the simplest explanation. Summary of "language only strategies": - Analyze keywords and context - Apply relevant subject matter knowledge - Use process of elimination - Make logical inferences The reasoning was closer to ’abduction’ as I made an informed guess based on patterns and typical causes, given that I lacked direct visual confirmation. 3. Next, we call rewrite_question . Here, the templated variable {{conversations}} has the full conversa- tions produced by the function analyze_why_answerable in the previous call and from all other iterations. The {{n_chat}} just indicates how many conversations there are included: Below, I will display {{n_chat}} chat conversations between a ’user’ and an LLM ’assistant’. 10 Page 23: In each conversation - a user asks the assistant to answer a multichoice VQA question, however they do not provide the image. They only get the question_stem and choices. - the question_stem and correct answer are similar to other conversations, however the distractors are different - the assistant then answers correctly. - the user then asks the assistant to explain how it answered the question with only the text. - the assistant then summarizes what strategy they used to answer the question. Altogether, these conversations give examples of language shortcuts that need to be avoided when constructing the question_stem and choices. Here are the conversations: {{conversations}} Your task is to rewrite the question_stem and choices so that a different LLM ’assistant’ cannot use the language-only strategies that were identified in these past conversations. Include an ’explanation’ about why your new set of distractors are better. Your revised choices should include the correct answer at the ’correct_index’. Your revised question_stem and choices should not significantly change the meaning of the question and correct answer. You are free to change the distractors a lot to achieve this task. Include {{n_choices}} choices. The choices should not be made too different in length and detail to the original question stem and choices. Note that the question stem may contain important cues that cannot be removed. E.g. if a question asks about the "green stain" in an image, you cannot change it to "stain" because it introduces ambiguity. The choices should not include the letters in front like (a), (b), ... Return a json The output of this function is a JSON and is: { "question_stem": "In an electron microscopy image of a neuron expressing human mHTT (BACHD) protein, distinct, spherical features are prominent and seem to be atypical. What is the most probable explanation for these features?", "choices": [ "Concentration of cellular waste products leading to distinct morphological anomolies", "Formation of air bubbles introduced during sample immobilization processes", "Ice contamination from liquid nitrogen during sample preparation", "Inclusion bodies formed by aggregated proteins expressing during stress response", "Cellular morphological changes during mitotic division" ], "correct_index": 2, "explanation": "The revised options introduce plausible scientific phenomena that could also result in spherical or distinct anomalies. This set of distractors is more varied and grounded in legitimate alternative explanations without leading the responder towards the correct answer through language patterns. For example, air bubbles and protein inclusion bodies might mimic sphere-like structures. Additionally, by adding an unrelated phenomenon like cellular changes during division, the set confuses the language pattern strategy previously used. Thus, the correct choice of \"Ice contamination from liquid nitrogen\" stands out less and necessitates either visual confirmation or real understanding of sample preparation outcomes to identify correctly." } Notice that the response includes an explanation of why it made the changes, and the changes relate to what was in the reflection, adding interpretability. 4. Finally, we check that the rewrite has not changed the semantic meaning of the question and answer. Prompt ******************************************************************************** Below are two question-answer pairs. The question-answer pairs are part of VQA triplets, and both pairs use the same image. Are these question-answer pairs semantically equivalent? Or are they significantly different? Give a true/false and also an explanation. QUESTION 1: In an electron microscopy image of a mouse primary embryonic cortical neuron expressing full-length human mHTT (BACHD), "eye-ball" shaped features are observed in the top left corner that appear anomalous. What is the most likely cause of these features? ANSWER 1: Ice contamination from liquid nitrogen during sample preparation QUESTION 2: 11 Page 24: In an electron microscopy image of a neuron expressing human mHTT (BACHD) protein, distinct, spherical features are prominent and seem to be atypical. What is the most probable explanation for these features? ANSWER 2: Ice contamination from liquid nitrogen during sample preparation ******************************************************************************** Response, is_equivalent: True ******************************************************************************** is_equivalent: Both question-answer pairs revolve around electron microscopy images depicting neurons expressing the human mHTT (BACHD) protein. The main focus in each question is on the presence of unusual spherical (or ’eye-ball’ shaped) features. In both cases, the identical explanation is provided: ice contamination from liquid nitrogen during sample preparation. While the wording differs slightly, the core components, contexts, and the ultimate explanation remain consistent between the two pairs. Thus, they can be considered semantically equivalent as they describe the same phenomenon with the same reasoning for the observed anomalies. 5. If the meaning check step fails, then we exit with a fail as we notice that once the meaning deviates it will become more distorted with further iterations. Otherwise, we update the MCQ, and start the loop again. Where the first function call in the new loop will check whether this revised MCQ is now too hard or needs to be rewritten. Calibrating the preservation of meaning Since there is a risk that the rewriter will change the semantic meaning of the question and answer, it is crucial that the check_meaning_preserved is effective. For topics requiring domain exper- tise, this requires prompt engineering with a domain expert providing feedback. This is one of the key human bottlenecks in tuning the prompts in this system. Other sources of MCQ invalidation While the check_meaning_preserved function checks that the question and answer is correct, an additional risk is that the rewrite introduces distractors that are more correct than the target answer. In our manual review evaluations (we reviewed all 1,042 MCQs), this was very rare. However, if this were a significant problem in future systems, it may be addressable by tuning the rewriter bot, with an expert human evaluator in the prompt engineering loop. Extending RefineBot beyond language shortcuts We constructed RefineBot specifically to address the issue of language shortcuts – where the question is answerable without using the image. This is appropriate because we work with vision- centric VQA, meaning the questions should require image interpretation. However, the framework should work in other settings – namely language-only QA. This would require adjusting the prompts slightly. For example, the current reflection prompt asks how the question was successfully evaluated without using language; also, the current rewriter prompt directs the LLM to address language shortcuts in its rewrite. Therefore, adapting Refinebot to language-only MCQ should make these prompts more general. On the other hand, the method could also be updated for the more general VQA setting – to improve questions that do not have language shortcuts, but are easy for some other reason. Here, the evaluator and reflector agents, which are currently LLMs, could be switched for an MLLM, which also ingests the image (though this would be expensive for current state-of-the-art models). We hypothesize that current MLLMs are so weak at image perception, that this is not necessary for MicroVQA – that simply removing language shortcuts will suffice for creating a challenging benchmark. As MLLMs improve at perceiving microscopy images, this general VQA RefineBot could become more useful for ensuring that the benchmark remains challenging. Repeated RefineBot iterations A single run of RefineBot may exit as a FAIL, either because the rewrite changed the meaning of the question, or because the maximum number of iterations was reached. However, RefineBot can give very different outputs when changing the random seed, and we find that simply running again with a different random seed can lead to a successful run. Therefore, if the bot fails, we rerun it. If an MCQ exited RefineBot successfully, then the evaluator LLM was fooled by the question (got the answer incorrect). Yet, the question could still be suboptimal in detectable ways: the LLM could be capable of narrowing the MCQ down to two options, but have guessed the wrong option; moreover, this question could be easier for different models. To ensure it is difficult enough overall, we require that the MCQ is answered incorrectly for two random seeds from two different models: gpt-4o-2024-08-06 andclaude-3-5-sonnet-20241022 . If not, then it fails, and we run the RefineBot again with a different random seed. 12 Page 25: We discuss in the experiments Sec. 5.1 how this process leads to a small evaluation bias against the models used as evaluators. That section shows that despite the bias, this process makes the MCQs generally harder for all models. Future versions of RefineBot could mitigate this by including a larger family of evaluator models. Finally, some MCQs fail to pass the RefineBot check. In these cases, we simply keep the stage 1 MCQ. E.5. MCQ generation results: stage 1 exam alignment The Stage 1 pipeline produced one-best-answer MCQs that preserved the content, testing focus, and validity of the expert- generated question-answer pairs. It generated an initial set of distractors following best practices outlined in educational and NBME guidelines. A key design goal of Stage 1 was to preserve the original question content without altering or introducing new information. Secondary objectives included ensuring proper formatting according to NBME guidelines and minimizing information that might provide clues to the correct answer without reference to the image. While the pipeline ensured alignment with these goals, preserving the original content occasionally resulted in MCQs containing extra information, such as image descriptions or specific biological terms (i.e., gene or protein name), that could inadvertently help narrow the options. For example, stage 1 question stems sometimes included protein names, disease descriptions, or descriptions of image features, which could reduce the challenge. Figure 11 illustrates Stage 1 outputs that are appropriately formatted but may preserve details from the original input, making questions less challenging. While the distractors were biologically plausible, some options were overly easy to eliminate, reducing the challenge. These limitations are addressed and refined in Stage 2. Theo1-mini-2024-09-12 LLM judge, employed during DSPy optimization, compared the Stage 1 MCQs to the original inputs to evaluate (1) semantic similarity and (2) adherence to MCQ formatting. The results showed that 96% of questions preserved the original meaning, and 100% adhered to NBME formatting guidelines. Manual analysis of the 4% labeled as at least partially dissimilar revealed that these cases typically involved open-ended original questions being reformulated into closed-ended MCQs. This transformation, a necessary step for creating one-best-answer MCQs, was not a failure but rather an expected outcome of Stage 1. In rare instances where the original submission included multiple correct answers, the output of Stage 1 retained one correct answer and excluded others to comply with the one-best-answer format. The LLM judge flagged these cases as partially dissimilar but these changes were essential for adhering to the MCQ structure. A common issue in MCQ design is that correct answers tend to be longer and more detailed than distractors. To assess this, we computed the ratio of tokens in incorrect options to those in the correct answer, where values ≤1indicate longer correct answers, and values ≥1suggest shorter correct answers. Stage 1 MCQs showed a ratio of 0.87, indicating that correct answers were only slightly longer than distractors. By comparison, naive MCQ generation yielded a ratio of 0.46, meaning correct answers were nearly twice as long as distractors, making them easier to identify. Additionally, naive MCQ generation preserved the original meaning in only 89% of cases and met NBME formatting guidelines in just 70%, highlighting the efficacy of the Stage 1 process at preserving information and ensuring proper MCQ formatting. E.6. MCQ generation results: stage 2 RefineBot The second stage of our MCQ generation method focuses on increasing the difficulty of the questions and options. The key idea is that we want to remove language shortcuts – information in the MCQ that allows answering the question without access to the image. To aid in understanding, we construct a toy example in Fig. 10 with the three types of language shortcuts that we identify. The questions should be paired with a stain that is mitochondria, and we want the VQA to require actually inferring that from the image. In shortcut 1, the text ‘gives away’ the image content so it’s trivial to answer the question. In shortcut 2: The LLM generates implausible or weak distractors. In shortcut 3: ’Language bias’, is a known problem in VQA [29]. Moving to real MCQs, in Fig. 4 we illustrate the transformation of a question at each stage, highlighting issues and corresponding improvements. While stage 1 produces exam-style questions that preserve the QA objective, the wording of the questions and distractors often enable MLLMs to exploit shortcuts rather than demonstrate knowledge. To address this, stage 2 (RefineBot) employs a chain-of-thought process. We analyze additional examples of improvements achieved in stage 2, showing how RefineBot improves both the style and content. The key strategies correspond to the three shortcut types that we identified: 1. (Visual giveaway) Over-description of the image eliminates the need for image interpretation. For example, in Fig. 11 (middle row), stage 1 includes details that reveal the specifics of the image, allowing the model to propose common issues without analyzing the image. Stage 2 replaces these specifics with general terms in the question stem. 13 Page 26: Figure 10. Three types of language shortcut relevant to MicroVQA. The target VQA has an image that is fluorescence microscopy stained with TOMM20 which would show a pattern consistent with visualizing mitochondria. 2. (Weak distractors) Distractors unrelated to the specifics of the scenario described by the question. In Fig. 11 (bottom row), stage 2 increases distractor complexity by incorporating elements tied to the question’s context, making them less predictable. 3. (Language bias) Overly precise wording in the question stem: Precise phrasing can make it easy to identify the correct answer through pattern matching. For example, Fig. 11 (top row), changing ‘eye-ball shaped’ to ‘spherical’ retains the question’s intent while creating more challenging distractors referencing plausible alternative processes. 14 Page 27: Question:In an electron microscopy image of a neuron expressing human mHTT(BACHD) protein, distinct, spherical features are prominent and seem to be atypical. What is the most probable explanation for these features?A) Concentration of cellular waste products leading to distinct morphological anomaliesB) Formation of air bubbles introduced during sample immobilization processesC) Ice contamination from liquid nitrogen duringsample preparationD) Inclusion bodies formed by aggregated proteinsexpressing during stress responseE) Cellular morphological changes during mitotic divisionCorrect answer: CQuestion:In an electron microscopy image of a mouse primary embryonic cortical neuron expressing full-length human mHTT(BACHD), "eye-ball" shaped featuresare observed in the top left corner that appear anomalous. What is the most likely cause of these features?A) Aggregation of human mHTTprotein within the neuron,B) Ice contamination from liquid nitrogen during sample preparationC) Formation of intracellular vesicles during neuron developmentD) Artifacts arising from excessive heavy metal stainingCorrect answer: BMCQ stage 1MCQ stage 2 Also involves sample processing, correct answer stands-out lessVery specific wording Plausible spherical structurePlausible spherical structureWording maintains meaning while relating to distractors Image Question:In a confocal microscopy image that displays an immortalized line of epithelial cells tagged with a fluorescent protein, one cell is notably larger than its neighbors and contains multiple nuclei. Considering possible observations, what could be the leading cause of multinucleation in this scenario?A) Enhanced nuclear division without cytokinesis completion during cell replicationB) Cellular response to an excessive nutrient-rich environmentC) Abnormal chromosomal segregation during meiosisD) The accidental integration of another distinct cell line into the cultureE) Natural behavior in certain transformed eukaryotic cellsCorrect answer: EQuestion:In a confocal microscopy image of HEK293T cells, the gene PSMB7 is endogenously tagged with GFP. One cell appears significantly larger than the others and contains multiple nuclei. What is the most likely reason for the presence of multi-nucleated cells in this culture?A) Increased rates of apoptosis in the cultureB) Cell fusion events leading to multinucleationC) Normal occurrence in immortalized cell linesD) Contamination with another cell typeCorrect answer: CMCQ stage 1MCQ stage 2ImageEasy to rule-out: unconnected to questionGives away image specificsMore general wording preserves content Relate better toquestion Question:Microscopy images of eukaryotic cells stained fornuclei (blue), cytoskeleton (red), and a nuclear-associated protein (green) show areas with unclear nuclear boundaries and faint cytoskeletal signals. The cells appear more compressed than anticipated. What might be the most plausible cause for this observation?A) Inadequate washing leading to excess dye retentionB) Slight misalignment of the optical components causing image distortionC)Improper focal settings creating unfocused regionsD) Irregular light source intensity impacting image clarityE)Low temperature causing cellular contraction affecting image detailsCorrect answer: CQuestion:Confocal microscopy images of U2OScells were prepared by staining nuclei with DAPI(blue), microtubules with an α-tubulin antibody(red), and the protein TREX1with a green antibody. In one of the images, the cells in the upper and lower parts appear to have blurred nuclei and partially visible microtubules, making the cells look smaller than expected. What is the most likely reason for this observation?A)Overfixationof cells leading to excessive cross-linkingB) Underexposure during image acquisition causing insufficient signalC)Out-of-focus sections resulting in blurred cellular structuresD) Non-specific antibody binding causing background fluorescenceCorrect answer: CMCQ stage 1MCQ stage 2Image Hard distractor: links question and image contentEasy to rule-out: unrelated to the questionGives away image modality Maintains key informationFigure 11. Examples of changes to questions and options between stage 1 and stage 2 (RefineBot) of our MCQ generation method. In red elements that need to be improved and in green improvements. F. Experiments details F.1. Evaluation prompts All prompts except for LLaV A and LLaV A-Med used the chain-of-thought prompt template from the MMMU-Pro code [88]: The following is a multiple choice question (with answers). Think step by step and then output the answer in the format of \"The answer is (X)\" at the end. {{QUESTION}} Options: {{CHOICES}} 15 Page 28: An example complete question is: The following is a multiple choice question (with answers). Think step by step and then output the answer in the format of "The answer is (X)" at the end. A fluorescence microscopy image displays cultured mouse primary neurons stained with a specific marker localized to mitochondria. The fluorescence signal exhibits an uneven distribution across the neuronal cell, and there is an absence of signal within the nucleus. What is the most probable cause for the uneven distribution of the mitochondrial signal? Options: (1): Mitochondrial transport along axonal and dendritic processes ensures uneven distribution based on cellular energy demands. (2): Differential expression of mitochondrial proteins leads to variable fluorescence signal across the neuron. (3): Mitochondrial biogenesis occurs at specific neuronal regions, causing localized fluorescence accumulation. (4): Mitochondrial anchoring to the cytoskeleton restricts their movement, resulting in uneven distribution. (5): Mitochondrial dynamics, including fission and fusion, allow mitochondria to move and distribute throughout the neuron. The answer is then extracted with the regex pattern: answer is ?([09])? . For LLaV A-Med, we added an extra parsing strategy. This was because we observed that while LLaV A-Med often gave reasonable responses to questions, it failed to follow instructions well. First, it did not follow the CoT prompt consistently tothink step by step . It also didn’t use the right answer formatting: "The answer is (X)\" . Since we were more interested in biomedical reasoning adhering to VQA formatting, we decided to handle this model differently. Specifically, to encourage CoT reasoning, we fixed the prefix of the response to: Response: Let’s think through this step by step and evaluate each option. . Then we added a suffix to the end: After carefully considering all the options, the most likely answer is ( , Then we continued generation, and did regular regex parsing to get the final answer. We also find that a few models make minor but consistent errors with formatting the answer and elect to fix them. For example x-ai/grok-2-vision-1212 commonly adds two asterisks before the answer like “The answer is **(4)**” instead of “The answer is (4)” ( GPT-4o also did this sometimes). Some other common errors that we saw but did not handle (because it would be difficult to do rigorously) was shown by meta-llama/llama-3.2-11b-vision-instruct : the response would write one of the MCQ options but not state its number. F.2. Model details Our main benchmarking experiments in Tab. 3 refer to models that have multiple versions. The exact model versions used for inference are shown in Tab. 6 Model Name API Endpoint Source & Details o1 o1-2024-12-17 OpenAI API Claude-3.5-Sonnet *anthropic/claude-3.5-sonnet-20240620 Openrouter API Gemini-Pro-1.5 google/gemini-pro-1.5 Openrouter API Pixtral-Large mistralai/pixtral-large-2411 Openrouter API Grok-2-Vision x-ai/grok-2-vision-1212 Openrouter API Qwen-2-vl-72b-Instruct qwen/qwen-2-vl-72b-instruct Openrouter API VILA1.5-40b VILA1.5-40b HuggingFace, local inference GPT-4o gpt-4o-2024-08-06 OpenAI API Llama-3.1-Nemotron-70b-Instruct nvidia/llama-3.1-nemotron-70b-instruct Openrouter API Llama-3.2-90b-Vision-Instruct meta-llama/llama-3.2-90b-vision-instruct Openrouter API Qwen-2-VL-7b qwen/qwen-2-vl-7b-instruct Openrouter API Claude-3.5-Haiku anthropic/claude-3.5-haiku Openrouter API Gemini-Flash-1.5-8b google/gemini-flash-1.5-8b Openrouter API GPT-4o-mini gpt-4o-mini-2024-07-18 OpenAI API Pixtral-12b mistralai/pixtral-12b Openrouter API VILA1.5-13b VILA1.5-13b HuggingFace, local inference Llama-3.2-11b-vision-instruct meta-llama/llama-3.2-11b-vision-instruct Openrouter API LLaV A-Med-Mistral-7B LLaVA-Med-Mistral-7B HuggingFace, local inference LLaV A-Mistral-7B LLaVA-Mistral-7B HuggingFace, local inference Table 6. AI model API endpoints and their sources 16 Page 29: For LLaV A and LLaV A-med, although the version number 1.6 is different from LLaV A-Med with 1.5, we do believe that LLaV A-Med was either trained from this LLaV A version, or from one that is very similar. They have identical architecture, and the version is very close; it is possible the LLaV A-Med copied the number incorrectly. The LLaV A-Med codebase or github issues does not clarify this at the time of writing. Note that for OpenRouter function calls to Anthropic, the date is not included in the string. We believe they use timestamps for older model versions, but not for the latest version. So reproducing it may require checking if the model has been updated. F.3. Human baseline on MicroVQA To perform a human baseline, we recruited co-authors who created questions. The conditions were: • Questions were provided in a google form. • Participants did not view any question that they created. • Internet access was allowed, but LLM access was not. This is because realistic biological practice is knowledge-intensive, and may require looking up facts or references. • Time limit was 3 minutes per question. They were free to take as many breaks as they wanted. • They answered the multiple choice question without further written justification. These conditions – 3 minutes and internet access – give sufficient time for an expert to make a reasonable attempt. Since the score was far below 100%, we claim that these questions could be of value to real experts as an LLM chat assistant. Each participant only took a subset of the questions (about 100 each). The final score is the aggregate. F.4. More experiments on attribute-level information ModelsGemini-1.5-ProVILA1.5-40BLlaVA-Med-Mistral-7B AccuracyElectron microscopyFluorescence microscopyLight microscopy Figure 12. Performance by image modality type for the best models: Gemini-1.5-Pro (closed source), VILA1.5-40B (open-source), and LlaV A-Med-Mistral-7B (medical). We described our metadata features in Sec. 3, and we can now leverage them to benchmark models on different attributes. Fig. 5 (left) shows all sub-tasks, and shows that ‘compare image sets’, which is common in biology to compare ‘treatment’ and ‘control’ groups. All models perform relatively well, despite multi-image reasoning being understudied in MLLMs [78]. Fig. 5 (right) shows that for Gemini and LLaVa-Med, higher Bloom’s levels – corresponding to more advanced reasoning – leads to worse performance. This supports the intuitive result that higher-level reasoning is more challenging – although VILA does not follow this trend. A final result in Fig. 12 shows that all models performed worse on cryo-electron microscopy images than brightfield or fluorescence microscopy. This may be due to cryo-EM’s challenging features (low contrast, low signal-to-noise ratio) and their under-representation in pretraining datasets [93]. F.5. Language shortcut ablations In Tab. 4 we do language shortcut ablations for some top-performing models and some lower-performing models – both no-image and choices-only ablations. 17 Page 30: Table 7. Ablation study on MicroVQA MCQs. Column 2 is overall accuracy. Cols 3-4 are the no-image ablation accuracy and drop compared to overall accuracy. Cols 5-6 are the choices-only ablation accuracy and drop compared to overall accuracy. See the text for discussion. No-image ablation Choices-only ablation Accuracy Accuracy Drop Accuracy diff o1 52.8 49.2 -3.6 37.7 -15.1 Claude-3.5-Sonnet 51.7 46.0 -5.8 44.0 -7.7 Gemini-Pro-1.5 51.1 47.2 -3.8 36.8 -14.3 Pixtral-Large 49.8 46.3 -3.6 36.7 -13.1 Grok-2-Vision 48.4 46.3 -2.1 40.5 -7.9 GPT-4o-mini 46.2 44.2 -1.9 34.3 -11.9 Pixtral-12b 45.6 43.7 -1.9 31.8 -13.8 Random 22.0 The no-image ablation is a common test in VQA evaluations[29, 74]. The MLLM only receives the text prompt, and we add the following sentence to the prompt template: If an image is mentioned ignore this information and try your best to answer the question. The choices-only ablation has been explored in (non-visual) question answering [7]. The model does not receive the image or the question and must guess an answer from only the choices. We add this sentence to the prompt template: However I will not give you the question text or the images, I will only give you the choices, so please try your best to answer the question. Both ablations lower the scores, but they are still above random. While this is initially surprising, it is actually consistent with other benchmarks. A key challenge in VQA benchmarks is for questions to be “vision-centric”: that the question requires using the image features to answer. We still have very strong evidence that MicroVQA is vision-centric, which we discuss below. First, let’s recall the language shortcuts by considering a VQA sample with an image of a green field, with the question “what color is the grass”. • ‘Language bias’ is well-known for enabling models to make educated guesses about MCQs without access to the image [29]. For the question “what color is the grass?”, then without the image, “green” is a likely guess, although there are other possible answers, like “brown”. • ‘Visual giveaway’ means there are too many image details in the question, for example if the question were “what color is the green grass?” This removes any need to look at the image. • ‘Weak distractors’ gives other question choices that are easily eliminated. For example, for question “what color is the grass?”, a bad distractors would be “pink”. This also removes any need to look at the image. (In Appendix E.6, we also discussed language shortcuts with biology examples as it related to MCQ construction.) Now we argue that MicroVQA is still vision-centric despite the ablation results. 1. Language bias allows cheating on vision-centric questions If answerable due to a ‘language bias’, the question may still require perception. A model may guess that the grass is green, but it is only a guess – the grass really could be brown [29]. Therefore correct answers under the ‘no-image’ ablation may still be vision-centric. On the other hand problems with ‘visual giveaway’ and ‘weak distractor’ shortcuts really do render the question not-vision-centric. The RefineBot method we employed in MCQ generation was designed to target language shortcut removal. Anecdotally, our qualitative analysis (from Appendix E.6) found that RefineBot tends to focus on fixing the visual giveaway and weak distractor shortcuts. Ideally we would perform a quantitative analysis to identify which language shortcuts are due to language bias versus the others, however analyzing these questions is very time-consuming. 2. The unsolved questions are hard because of perception . The ‘no-image’ ablation only tells us about the ‘easy’ questions, but half the quesitons are too hard for current models. Our qualitative analysis of 30 failure cases demonstrates that half of them made critical perception errors, which strongly suggests they require image understanding In contrast to the ‘vision-centric’ claims from the Cambrian paper [74] : They show that many VQA benchmarks have no-image ablation scores only slightly underperforming the with-image scores, and they conclude that such VQA benchmarks “do not properly measure vision-centric capabilities”. But that conclusion cannot be drawn from only the no-image ablation 18 Page 31: (following our point 2 above). For benchmarks where MLLM’s score far from 100%, the failures could mostly be due to poor image understanding. This is very plausible since visual capabilities in MLLMs are viewed as lagging behind language capabilities. Likewise, our experience with our microscopy data was that visual understanding was poor, but that language knowledge was impressive. In point 1 above, we argued that MLLMs can use language biases to pass the no-image ablation, even if they are vision- centric questions. This is actually in agreement with the claim from Cambrian that those questions don’t test vision-centric capabilities. Finally, some non-vision centric questions are acceptable . The goal of MicroVQA is to source realistic problems from microscopy research. While this area does lean heavily on image analysis, not every question will realistically require perception, so this is acceptable. 19 Page 32: F.6. Error analysis Here we show samples from the manual qualitative error analysis (details in the main paper). To see all 30 examples, refer the github repo. F.6.1 Perception errors VLM Answer:Let me think about this step by step:1) First, let's look at what we see in the image:-A cryo-electron tomography slice showing cellular structures-Circular-like dark structures visible-The resolution is quite high (5.3 Åafter binning)2) Let's consider the size and appearance of these structures:-They appear as dense, small, roughly circular objects-The size is consistent with ribosomes (approximately 25-30 nm)-They show the characteristic electron density pattern of ribosomes3) Let's analyze each option:-Ribosomes: Match the size, shape, and density pattern-Phase-separated condensates: Usually less defined, more amorphous-Lysosomes: Much larger, membrane-bound organelles-Peroxisomes: Also larger, membrane-bound organelles4) Additional considerations:-Ribosomes are often found associated with mitochondria-The resolution is sufficient to resolve ribosomal structures-The contrast and density are typical of ribosome imaging in cryoETThe answer is (1)Image key 5 This is a slice of cryoET tomogram of mammalian cells acquired at 1.3 A but binned by4 (5.3 A/pix)Q:A cryo-electron tomography (cryoET) slice of mammalian cells was acquired at a resolution of 1.3 Åand binnedby 4 (resulting in a 5.3 Åper pixel). Within the tomogram, dark circular like structures are observed in an organelle. What is the modest likely identity of these structures?A) RibosomesB) Phase-separated condensates such as stress granulesC) LysosomesD) Peroxisomes Subject: Cell and molecular biologyBloom’s level: 4 (Analysis)Specimen: mammalian cells Error type: PerceptionError reason: These are granules characterized by heterogeneous, electron-dense aggregates. Ribosomes would be smaller, more regular in shape, and not have as high contrast. Figure 13 20 Page 33: Q:A hematoxylin and eosin (H&E)-stained micrograph shows cellular formations within a gastrointestinal tissue sample with unique organization and cellular patterns. What pattern and morphology do these cells most closely align with?A) Bundles of elongated cells with pale pink cytoplasm and uniform oval nuclei in a crisscross patternB) Circular groups of cells with mucin in central spaces and small, dark nucleiC) Dense arrays of polygonal cells with small nucleoli and vacuolated cytoplasmD) Rows of cells with conspicuous cytoplasm and rounded nucleiE) Diffused arrangement of cells with transparent cytoplasm and irregularly clustered chromatin Subject: AnatomyBloom’s level: 4 (Analysis)Specimen:gastrointestinal lesion biopsyVLM Answer: Let me analyze this step by step:1) Looking at the H&E-stained micrograph:-It shows a circular cross-section of gastrointestinal tissue-There's a clear organizational pattern along the periphery-The center shows pink-stained connective tissue (lamina propria)2) Focusing on the cellular arrangement:-The cells are arranged in circular groups-These groups form glandular structures-There are clear central spaces (lumens) in these structures-The nuclei appear dark and are located at the base of the cells3) Comparing with the options:-Option 1 describes elongated cells in crisscross pattern -doesn't match-Option 2 describes circular groups with mucin in central spaces and small, dark nuclei -matches perfectly-Option 3 describes polygonal cells with vacuoles -doesn't match-Option 4 describes rows of cells -doesn't match-Option 5 describes diffused arrangement -doesn't match4) The image shows typical intestinal glands (crypts) with:-Circular arrangement-Central lumens containing mucin-Small, dark nuclei at the base of the cellsThe answer is (2)Error type: PerceptionError reason: No. The intestinal glands are not thelesion.Thereisapinkspindle cell lesion with a fascicular arrangement of cellsImage key 157 H&E-stained; micrograph of gastrointestinal lesion biopsyFigure 14 21 Page 34: Q:An H&E-stained image of a nerve sheath tumor displays unique cell arrangements and marked changes within its structure. Given these observations, what is the most likely reason for the observed morphological alternations?A) Indication of rapid cellular proliferationB) Sign of tumor shrinkage post-treatmentC) Result of genetic changes within the tumor cellsD) Outcome of a chronic process affecting the tumorE) Sign of metastasis to another location Subject: PathologyBloom’s level: 4 (Analysis)Specimen:schwannoma tissueVLM Answer:Let me analyze this step by step:1) Looking at the H&E-stained images, I can observe:-Dense cellular proliferation-Distinctive "herringbone" or swirling pattern-High cellularity with numerous nuclei-Evidence of active growth pattern2) Key characteristics visible:-The cells are arranged in fascicles-There's a high nuclear to cytoplasmic ratio-The pattern suggests active tumor growth-No signs of necrosis or treatment effect-No obvious signs of metastatic spread3) Analysis of the options:-Option 1 matches with the high cellularity and growth pattern-Option 2 doesn't fit as there's no evidence of shrinkage-Option 3 while possible, isn't directly evident in morphology-Option 4 doesn't match the active growth pattern seen-Option 5 isn't supported by the histological features shown4) The morphological features strongly suggest active tumor growth with:-Characteristic fascicular pattern-High cellular density-Well-preserved tumor architecture-Signs of active proliferationThe answer is (1)Error type: PerceptionError reason: No. One image shows cystic change, which can be seen in chronic, longstanding schwannomas. The other shows Verocaybodies and fascicular growth, which do not indicate rapid cellular proliferation. Figure 15 22 Page 35: Q:In a fluorescence microscopy image comparing wounded and unwounded liver tissues stained to visualize markers of immune response –neutrophil derived migrasomes(Ly6G, yellow) and platelets (CD41, magenta), what is most likely observed about the distribution of these markers in unwounded tissue?A) Slight overlap without any notable biological consequencesB) Substantial overlap implying persistent immune activationC) No overlap, indicating complete segregation of cellular functionsD) Randomized distribution suggesting passive circulationE) Significant separation highlighting cellular non-interaction Subject: PhysiologyBloom’s level: 2 (Comprehension)Specimen: mouse liver tissue (wounded and unwounded)VLM Answer:Let me think about this step by step:1) The image shows two types of staining:-Yellow (Ly6G) for neutrophil-derived migrasomes-Magenta (CD41) for platelets2) Looking at the unwounded tissue (presumably the left image):-There are distinct yellow dots (migrasomes)-There are magenta structures (platelets)-The distribution appears fairly separated-The markers rarely overlap3) This makes biological sense because:-In healthy/unwounded tissue, immune cells and platelets typically maintain their distinct roles-Without injury, there's no strong stimulus for interaction-The cells are likely performing their normal surveillance functions4) Analyzing the options:-Option 1 suggests slight overlap -more than what's visible-Option 2 suggests high overlap -not seen-Option 3 indicates complete segregation -matches the image-Option 4 suggests random distribution -the pattern looks more organized-Option 5 also suggests separation -similar tooption 35) Option 3 best matches what we observe: "No overlap, indicating complete segregation of cellular functions"The answer is (3)Error type: PerceptionError reason: The markers rarely overlap, but there is very focal colocalization in some puncta. The selected answer argues for “complete segregation of cellular functions”, which is a too extreme or absolute conclusion based on the image. Figure 16 23 Page 36: F.6.2 Knowledge errors Q:A double immunofluorescence experiment was conducted on nerve cells, highlighting Sema4B (red marker) and Sox9, an astrocyte marker (green marker). Sema4B was found primarily surrounding the nuclear area dispersed in the cytoplasm. Based on this pattern, what potential role might Sema4B play in these cells?A) Sema4B contributes to the regulation of metabolic pathways within the cytoplasm.B) Sema4B is associated with intracellular trafficking processes.C) Sema4B acts as a signaling molecule interacting with cell surface receptors.D) Sema4B aids in transporting proteins across cell compartments.E) Sema4B is directly involved in gene transcription regulation inside the nucleus. Subject: NeurobiologyBloom’s level: 3 (Application)Specimen: Nerve cellsVLM Answer:Let me solve this step by step:1) Looking at the immunofluorescence images:-Sox9 (green) marks astrocytes-Sema4B (red) shows a particular distribution pattern-The merged image shows both markers plus nuclear staining (blue)2) Analyzing Sema4B's distribution:-It appears as red punctate staining-Located primarily around nuclear areas-Shows dispersed pattern in cytoplasm-Not concentrated in nucleus itself3) This distribution pattern suggests:-The protein is involved in cytoplasmic processes-Shows vesicular-like pattern-Located in areas suggesting trafficking or transport4) Looking at the options:-Option 4 mentions protein transport-This matches the vesicular-like pattern-The perinuclear and cytoplasmic distribution is consistent with protein trafficking-Other options don't align as well with the observed distribution pattern5) Reasoning:-The punctate pattern and distribution is characteristic of proteins involved in cellular trafficking-The location around nucleus (where protein synthesis occurs) and throughout cytoplasm supports a transport role-This pattern is less consistent with metabolic, signaling, or transcriptional rolesThe answer is (4)Error type: Misconception (knowledge)Error reason: Sema4B is a transmembrane protein that plays a role in cellular signaling. TM proteins can be internalized via endocytosis and does not mean that its main function is intracellular trafficking. Figure 17 24 Page 37: Q:A cryo-Electron Tomography image of fibroblasts from a 70-year-old patient with Alzheimer’s disease shows three mitochondria gathered together. What is the most likely reason for this mitochondrial clustering.A) Increased mitochondrial biogenesisB) Enhanced mitochondrial fusion leading to interconnected networksC) Mitochondrial fission due to cellular stressD) Impaired mitophagy causing accumulation of damaged mitochondria Subject: NeurobiologyBloom’s level: 4 (Analysis)Specimen: Fibroblasts from an Alzheimer’s patientVLM Answer:Let's solve this step by step:1) First, let's consider the context:-This is from a 70-year-old patient with Alzheimer's disease-Alzheimer's is known to affect cellular function, including mitochondrial health-We're observing clustered mitochondria in the image2) Analyzing each option:-Option 1 (Increased biogenesis): Would show new, healthy mitochondria, not clustering-Option 2 (Enhanced fusion): Would show elongated, connected mitochondria-Option 3 (Mitochondrial fission): Would show smaller, fragmented mitochondria-Option 4 (Impaired mitophagy): Would show accumulated, potentially damaged mitochondria3) Looking at the image:-We see multiple mitochondria clustered together-They appear to be intact but gathered in groups-This suggests they're not being properly cleared from the cell4) Considering Alzheimer's disease:-It's known to impair cellular cleanup mechanisms-Mitophagy is a key quality control process for mitochondria-When impaired, damaged mitochondria accumulate instead of being removed5) Conclusion:The clustering is most likely due to impaired mitophagy, as this would explain why multiple mitochondria are gathering instead of being properly cleared from the cell, which is a known issue in Alzheimer's disease.The answer is (4)Error type: Misconception (knowledge)Error reason: Impaired mitophagy may result in larger overall mitochondria, not an increase in small mitochondrial aggregates. Fission is a better answer in this case. Figure 18 25 Page 38: Q:A microscopic analysis of tissue from an individual suffering from recurrent seizures involves using hematoxylin & eosin stain alongside a vimentin stain. Cells with stronger vimentin expression are significantly larger than surrounding entities. Which identity is most probable for these prominent vimentin-expressing cells?A) Dystrophic neurons typically found around glial scarsB) Enlarged reactive astrocytes near injury sitesC) Aberrant large cells characteristic of cortical malformationD) Macrophages adapted to engulf cellular debrisE) Fibroblast proliferation in response to brain injury Subject: NeurobiologyBloom’s level: 3 (Application)Specimen: Resection specimen for epilepsyVLM Answer:Let me solve this step by step:1) First, let's analyze what we're seeing in the images:-The H&E stain (top image) shows tissue with scattered cells and some enlarged cells-The vimentin stain (bottom image) shows strong brown staining of certain larger cells2) Key information from the question:-Patient has recurrent seizures-Vimentin-positive cells are significantly larger-Looking for most probable identity of these cells3) Important facts about vimentin:-Vimentin is an intermediate filament protein-Strongly expressed in reactive astrocytes-Also expressed in mesenchymal cells4) In the context of seizures:-Seizures can cause brain injury-Brain injury triggers astrogliosis-Astrocytes become reactive and enlarged in response to injury5) Looking at the options:-Option 1: Neurons don't typically express vimentin-Option 2: Reactive astrocytes do express vimentin and enlarge during injury-Option 3: Cortical malformation cells wouldn't necessarily express vimentin-Option 4: Macrophages are typically smaller-Option 5: Fibroblasts would be rare in this context6) The morphology and staining pattern of the large cells, combined with the clinical context of seizures, most strongly suggests reactive astrocytes.The answer is (2)Error type: Misconception (knowledge)Error reason: The AI is able tointerpret the question and image, butlacks the knowledge to recognize the link between recurrent seizures, cortical dysplasia, and large ballooned cells. Figure 19 26 Page 39: Q:A certain specialized stain reveals unusual cell movement crossing boundaries typical of neuronal migration in a tissue resection from a case of structural brain anomaly. Which condition is primarily associated with this type of cellular pattern?A) Focal cortical dysplasiaB) Congenital Zika syndromeC) SchizencephalyD) Type II lissencephalyE) Walker-Warburg syndrome Subject: NeurobiologyBloom’s level: 4 (Analysis)Specimen: Brain parenchymaVLM Answer:Let me solve this step by step:1) The image shows a histological section with specialized staining that reveals abnormal cellular organization and migration patterns.2) The question asks about unusual cell movement crossing typical neuronal migration boundaries.3) Looking at the options:-Focal cortical dysplasia involves disorganized neurons in localized areas-Congenital Zika syndrome primarily affects neural progenitor cells-Schizencephaly involves clefts in the cerebral hemispheres-Type II lissencephaly shows abnormal neuronal migration and positioning-Walker-Warburg syndrome is a severe form of congenital muscular dystrophy with brain malformations4) Key characteristics of focal cortical dysplasia include:-Abnormal neuronal migration-Disorganized cortical layering-Abnormal cell types crossing normal boundaries-Disrupted cerebral cortex architecture5) The histological image shows features consistent with focal cortical dysplasia, including:-Disrupted cellular organization-Abnormal migration patterns-Cells crossing typical boundariesThe answer is (1)Error type: Misconception (knowledge)Error reason: Misconception about the scope and extent of neuronal migration patterns in focal cortical dysplasia (FCD) versus Type II lissencephaly. FCD will have abnormal cortical lamination, but the neurons are still within the expected gray/white matter of normal neuronal migration. They do not extend beyond and cross established neuronal boundaries such as the glial limitans. Figure 20 27 Page 40: F.6.3 Overgeneralization or oversimplification errors Q:In a histological examination, an H&E-stained image shows Verocaybodies and cystic degeneration with pseudoepithelialmetaplasia, indicative of schwannoma. How does the risk of malignant transformation in schwannoma compare to other peripheral nerve sheath tumors?A) Schwannomas exhibit a lower risk of malignant transformation compared to other peripheral nerve sheath tumors.B) Schwannomas and neurofibromas have similar risks of malignant transformation.C) The risk of malignant transformation in schwannomas is dependent on specific genetic factors.D) Schwannomas have an increased risk of malignancy in the presence of multiple tumors.E) Schwannomas rarely undergo malignant transformation, regardless of tumor location or size.VLM Answer:Let me think about this step by step:1) The images show typical histological features of schwannoma with:-Verocaybodies (palisading nuclei)-Cystic degeneration-Pseudoepithelialmetaplasia2) Regarding malignant transformation in schwannomas:-Schwannomas are typically benign tumors-They rarely undergo malignant transformation-This is true regardless of their size or location-Even multiple schwannomas (as in schwannomatosis) rarely become malignant3) Comparing to other peripheral nerve sheath tumors:-Neurofibromas, especially plexiform types, have a higher risk of malignant transformation-This is particularly true in the setting of NF1 (neurofibromatosis type 1)-Malignant peripheral nerve sheath tumors (MPNSTs) more commonly arise from neurofibromas than schwannomas4) The statement about genetic factors or increased risk with multiple tumors isn't accurate for schwannomas, though it may be true for other nerve sheath tumors5) Looking at the options, the key is that schwannomas maintain their benign nature in most cases, regardless of circumstances.The answer is (5)Error type: OvergeneralizationError reason: The selected choice does not address the question asked“compared to other peripheral nerve sheath tumors?”. The “best answer” for the question should compare to other tumors as specified in the question. The AI has “oversimplified” the question to ask a simpler question than the one asked. Subject: PathologyBloom’s level: 2 (Comprehension)Specimen:schwannoma tissue Figure 21 28 Page 41: Q:Confocal microscopy images depict live HEK293 cells subjected to CRISPR/Cas9-mediated knockdown of a specific gene, alongside wildtype cells. The cells are stained with BODIPY-C12 (cyan) to highlight lipid droplets and Hoechst to stain the nuclei. In wildtype cells, distinct organellar structures are visible, whereas these structures are absent in the gene knockdown cells. What is the most likely cellular outcome resulting from the knockdown of this gene in HEK293 cells?A) Disruption of mitochondrial fissionB) Impaired peroxisome biogenesisC) Altered lipid metabolism pathwaysD) Reduced endoplasmic reticulum functionE) Decreased nuclear transport processesVLM Answer:Let me solve this step by step:1) First, let's analyze what we're seeing in the images:-Live HEK293 cells are shown-BODIPY-C12 staining (cyan) shows lipid droplets-Hoechst staining shows nuclei (grey/white)-The images show a time series (left to right)2) Key observations:-The cyan staining (BODIPY-C12) indicates the presence of lipid droplets-There's a difference between wildtype and knockdown cells in terms of organellar structures-The staining specifically targets lipids3) Consider the options:-Mitochondrial fission would primarily affect energy metabolism-Peroxisome biogenesis affects lipid breakdown-Lipid metabolism directly relates to lipid droplet formation-ER function is broader than just lipid effects-Nuclear transport isn't directly related to lipid droplets4) Reasoning:-Since BODIPY-C12 specifically stains lipids-The visible differences are in lipid-containing structures-The most direct connection would be to lipid metabolism-Changes in lipid metabolism would directly affect lipid droplet formation and distributionThe answer is (3)Error type: OvergeneralizationError reason: Assumes that the changes in lipid droplets are directly linked to altered lipid metabolism pathways, without fully integrating the implications of peroxisome biogenesis in lipid metabolism and organelle dynamics. Subject: Cell and molecular biologyBloom’s level: 4 (Analysis)Specimen:HEK293 cells (wildtype and PEX19 knockdown) Figure 22 29 Page 42: F.6.4 Hallucination or errors due to added details Q:A CryoETtomographic slice image and its corresponding 3D annotation of cellulose fibers in vitro show differing amounts of cellulose fibers. Beyond the 2D nature of the slice versus the 3D nature of the annotation, which mechanism most likely accounts for the observed differences?A) Errors in the alignment during image reconstruction leading to artifacts.B) Overlapping fibers being misinterpreted as single fibers due to resolution limits.C) Variability in manual annotation leading to subjective bias.D) Sample dehydration causing shrinkage of fibers in the image.E) Background noise in the image causing misidentification of fibers.VLM Answer:Let me solve this step by step:1) First, let's analyze what we're seeing:-Left image: CryoETtomographic slice showing cellulose fibers in grayscale-Right image: 3D annotation of the fibers in blue-There appears to be a difference in how many fibers are visible/annotated2) Consider the limitations of CryoETimaging:-Resolution is limited-Multiple fibers can appear to overlap in 2D views-Depth information can be challenging to interpret3) Evaluate each option:-(1) Alignment errors typically cause obvious artifacts, which aren't apparent here-(2) Resolution limits could cause overlapping fibers to appear as one in the 2D slice-(3) Manual annotation variability wouldn't explain systematic differences-(4) Sample dehydration would affect both views equally-(5) Background noise might affect fiber identification but isn't the primary issue4) The key observation is that the 2D slice shows fewer distinct fibers than the 3D annotation, likely because:-Multiple fibers that overlap in the 2D view can be properly separated in 3D-The resolution limits of the microscope make it difficult to distinguish overlapping fibers in the 2D viewThe answer is (2)Error type: HallucinationError reason: The question does not state there are systematic differences. This detail is added during the AI CoTreasoning. A better answer would rule out easy or obvious causes of differences (random or systematic). Although fibers can overlap in Cryo-EM, it is generally considered to have a reasonably high resolution.Subject: Cell and molecular biologyBloom’s level: 4 (Analysis)Specimen: cellulose fibers in vitro Figure 23 30 Page 43: F.7. Reasoning trace analysis F.7.1 Human step-by-step reasoning To better understand how subject matter experts approach questions in MicroVQA, we have a cell biologist and microscopist manually annotate their reasoning for each of the three main use cases. Reasoning example for task 1 ‘expert visual understanding’. *Question * What subcellular structure could the bright green spots represent? * *Answer * This staining pattern most likely refers to cytoplasmic bodies. *Reasoning: * - Find green spots (no reasoning required) - Rule out staining artefacts - Look at marker channels to assess whether cells appear intact - Compare the green spots to the marker channels for nuclei or microtubules - Check that green spots are always inside the cells - Yes, inside cells, so these are no staining artefact. - Look up which spotty staining patterns in cells exist - Nuclear bodies - Nuclear speckles - Nucleoli - Cytoplasmic bodies - Centrosome - Vesicles - Intermediate filaments - Spotty microtubule staining - Explore which patterns can be excluded - Check whether green spots are inside or outside of nuclei to know what patterns to explore - Outside of nuclei, thus nuclear subcellular localizations can be excluded, and cytoplasmic spotty patterns need to be explored - Check whether green spots are placed on microtubules - No, thus cannot be spotty microtubule stain - Assess localization > if accumulating around microtubule organizing center could be likely vesicles or centriolar satellites - No, not accumulating at centrosome - Assess size and shape of the spots - Small size and round, can exclude intermediate filaments - Count green spots per cell and check that not more than 10 - Less likely vesicles, likely cytoplasmic bodies Reasoning example for task 2 ‘hypothesis generation’. *Question * The sperm cells swim in circles. Wouldn’t they need to swim straight to reach the egg? *Answer * Sperm motility and its relation to reaching the egg and fertility is very complex and still not fully understood. Yet, it is known that sperm swimming in circles in a shallow observation chamber like you might have applied, may swim on helical or twisted-ribbon structures in an in vivo or 3D space and thereby might in fact efficiently swim forward. *Reasoning * - Explore the swimming trajectories to validate what the user means with swimming in circles. - Detect trajectory, assess how it would be described: curved or circular swimming trajectory - Explore what is reported on swimming trajectories of sperm cells. - Sperm cell swimming trajectories are species dependent - Validate species: - Assess whether the user mentioned the species in the prompt. - Assess look of cells in image (morphology, geometry) and compare to other images with known species eventually - Likely human sperm. - Sperm cell swimming trajectories are experimental setup dependent: shallow versus deep observation chamber - Validate setup: - Brightfield microscopy, sperm cells are always swimming parallel, little to no off-focus blurring of the tail. - Assess how images from different experimental setups look like for human sperm - Likely shallow cover-slip-based observation chamber - Consequences of setup for trajectory In shallow observation chambers sperm cannot freely beat and thus beat parallel to 31 Page 44: the glass surface, with no 3D component. - Sperm cells hitting a glass surface align their beat with the glass and this way can get trapped if asymmetrically beating. - Curved trajectory in 2D may relate to a progressive/straight swimming path in 3D - Sperm cell swimming trajectories are stimulus dependent and maturation-state dependent. - Validate setup and explore metadata provided: - Check literature and compare image to available images of sperm from healthy donors (if available). - Compare detected swimming trajectory to swimming trajectories of sperm explored in a similar species and experimental setup where sperm cells are perturbed (if available). - Are there indications for any treatment under which the detected swimming trajectory would be abnormal and how would these look like (if available). - No indications for any treatment. Neither in prompt nor in image. - Explore what is known on sperm cell trajectories to reach the egg in humans. - Not much is known since in vivo studies are not possible. - Sperm cells need to be able to develop hyperactive motility to swim to the egg - How do hyperactive sperm cells look? - Asymmetric beating - What trajectory would they swim in in a 2D observation chamber? - Curved trajectory - Does that fit the swimming trajectory presented? - Yes - Assess whether when summarizing all the obtained information the conclusion from the observed swimming trajectory to sperm not reaching the egg is valid. - No. - Compose summarizing reply to support the conclusion. Reasoning example for task 3 ‘experiment proposal’. *Question * Can you suggest a simple experiment to find out whether centrosomal multiplication is occurring here? *Answer * One option could be to co-stain the EDC3 antibody with a centrosomal marker like PCNT or CEP250. *Reasoning * - Extract hypothesis to be tested from the question - EDC3 is labeling centrosomes and thus, centrosomal multiplication is occuring in the pictured cells - Detect the experimental setup - Immunofluorescent antibody staining on a protein of interest (EDC3) in a human cultured cell type - Compare to similar experimental setups that support such a hypothesis - Immunofluorescent antibody staining of centrosomes, co-labeling cell markers, performing microscopy, counting centrosomes - Acquiring or creating a transgenic cell line with a GFP-labeled centrosomal proteins, followed by microscopy - Determine which methods are compatible with EDC3 staining - Both compatible with co-staining EDC3 - Determine which experiment is simplest for someone performing the shown experimental setup and if equal which one can be realized in less time - Immunofluorescent staining of EDC3 and centrosomes - Find centrosome markers for human cells and assess suitability for this cell type - Final decision on experimental setup F.7.2 LLM-based error analysis To better understand how the model approaches questions in MicroVQA, we extend our manual reasoning trace analysis from a subset of the dataset to all questions by leveraging zero-shot prompting of an LLM. The automatic error categorization achieves 63% accuracy compared to manually assigned categories. Although imperfect, automated error categorization is scalable and preserves the overall distribution of human-identified errors, enabling broader analysis. We also classify correctly answered questions to quantify overall changes across experiments. Below, we present the prompts and category definitions used for classification. Below is a multiple choice question with options and the reasoning that lead a model to an incorrect response. Originally the model was also shown an image with the question. Your task is to use the reasoning trace to tag the error type: - Perception: the image was not interpreted correctly. - Overgeneralization: the details of the question were ignored and the general case was applied. - Hallucination: details were added during reasoning that weren’t in the question or extracted from the image. - Other: the error does not fit the above categories. Question: 32 Page 45: {{question}} Correct answer: {{correct_answer}} Resoning trace: {{reasoning}} Below is a multiple choice question with options and the reasoning that lead a model to a correct response. Originally the model was also shown an image with the question. Your task is to use the reasoning trace and determine if the question was answered because of these reasons: - No image: the image is unnecessary because the correct answer doesn’t rely on interpreting visual cues. - Visual giveaway: the image is unnecessary because critical visual information is already described in the question itself. - Language bias: the question has information that makes the correct option obvious. - Weak distractors: the distractors are easy to rule out according to the reasoning trace. - Good question: the question is well-constructed and needs the image and specialized knowledge to answer. - Other: the question is hard to answer or doesn’t fit the other classes. Question: {{question}} Correct answer: {{correct_answer}} Resoning trace: {{reasoning}} No image14.6% Other correct0.3%Visual giveaway0.3%Language bias0.7%Weak distractors20.9% Good question4.3%Perception0.6%Other wrong5.3%Hallucination8.2% Overgeneralization44.9%Overgeneralization27.3%Perception19.6%Other wrong7.4%Hallucination5.0%Good question13.9% Visual giveaway1.0%Weak distractors13.5% No image12.4%Language bias0.9%Other correct0.2%Without image (ablation)With image Figure 24. Reasoning trace categorization with an LLM (GPT-4o): Correct (blue) vs. incorrect (red) answers. Comparison of Claude- Sonnet’s responses without (left) and with (right) image access. We apply this system to examine the impact of providing or withholding the image for each question (Fig. 24). While overall accuracy remains comparable across both conditions, the distribution of errors and successful reasoning patterns differs meaningfully. As expected, hallucination errors decrease when the image is available, perception errors are rare when the model lacks visual input, and the proportion of questions that do not require an image remains stable. Most errors arise from questions requiring visual and contextual understanding, highlighting the importance of multi- modal reasoning. Overgeneralization errors decrease when the image is provided, as the model is better grounded in the 33 Page 46: question context. However, many questions remain challenging even with visual input, as perception errors increase. This suggests that effectively solving our dataset requires models to both maintain the specifics of the question’s context and accurately interpret visual information. Additionally, note that the proportion of questions answered correctly through language shortcuts (e.g., language bias and visual giveaways) remains low in both settings, suggesting the effectiveness of the RefineBot strategy. However, addressing weak distractors remains a persistent challenge for specialized datasets. 34 Page 47: Figure 25. Collage of images from MicroVQA.35

---