loader
Generating audio...

arxiv

Paper 2503.10533

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Authors: Robin Schmucker, Steven Moore

Published: 2025-03-13

Abstract:

High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT parameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice questions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors). Overall, while IWFs are useful for predicting IRT parameters--particularly for screening low-difficulty MCQs--they cannot replace traditional data-driven validation methods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.

Paper Content:
Page 1: The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory Robin Schmucker Machine Learning Department Carnegie Mellon University Pittsburgh, PA, USA rschmuck@cs.cmu.eduSteven Moore Human-Computer Interaction Carnegie Mellon University Pittsburgh, PA, USA stevenmo@cs.cmu.edu ABSTRACT High-quality test items are essential for educational assess- ments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pi- lot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT pa- rameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice ques- tions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statis- tically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distrac- tors). Overall, while IWFs are useful for predicting IRT parameters–particularly for screening low-difficulty MCQs– they cannot replace traditional data-driven validation meth- ods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that un- derstand domain-specific content for robust item validation. Keywords item response theory, item-writing flaws, item analysis, au- tomated qualitative coding, large language models 1. INTRODUCTION Multiple-choice questions (MCQs) are recognized as an ef- fective and widely used form of assessment across diverse educational domains. Ensuring these questions are of high quality is critical for maintaining validity, reliability, and overall soundness of assessing student learning [36, 11]. In both standardized testing (e.g., GRE, MCAT, SAT) and classroom assessments, rigorous evaluation is applied to re-tain only the most reliable MCQs [19]. This process allows educators and researchers to make targeted improvements, revising or discarding flawed items to better measure stu- dent learning. Among the established methods for evalu- ating MCQ quality, Item Response Theory (IRT) is often considered the gold standard [5, 40]. By quantifying item performance through parameters such as difficulty and dis- crimination, IRT provides valuable insights into how stu- dents interact with different questions. While IRT has proven effective at capturing statistical di- mensions of item performance, it does not fully explain why certain questions might vary in difficulty or discrimination. It requires substantial student response data and operates post hoc, often identifying poor-quality questions only af- ter students have encountered them [47]. Additionally, IRT parameters may overlook qualitative aspects of question de- sign, such as pedagogical soundness and specific flaws that decrease assessment integrity. Expert review and rubric- based evaluations help address these limitations by detect- ing specific question design flaws that may skew assessment outcomes [31, 8]. While researchers acknowledge that such flaws influence item performance, a systematic examination of how qualitative shortcomings in item design interact with quantitative IRT measures across different domains remains limited. Empirical evidence linking specific flaws to changes in item discrimination and difficulty could clarify why cer- tain questions perform poorly. To address this gap, the present study integrates IRT analy- sis with the standardized Item-Writing Flaws (IWF) rubric [58]–an instrument for expert evaluation of MCQ quality. To explore relationships between IRT- and IWF-based eval- uations, we analyze datasets across diverse educational do- mains: life and earth sciences, physical sciences, and mathe- matics, encompassing the middle and high school grade lev- els. These datasets combine 7,126 MCQs with response data of 448,000 students within a large-scale online learning plat- form. For each question, we compute difficulty and discrim- ination parameters and automatically apply the 19-criterion IWF rubric. By comparing these quantitative and qualita- tive evaluations, we aim to demonstrate how specific design flaws influence item performance across subject domains. We investigate three primary research questions: RQ1 How does the frequency of IWFs correlate with IRT difficulty and discrimination parameters across MCQs from different educational domains?arXiv:2503.10533v1 [cs.CL] 13 Mar 2025 Page 2: RQ2 Which IWF criteria are most strongly associated with low-quality questions, as indicated by IRT difficulty and discrimination parameters? RQ3 To what extent can IWF-based item features be used to (i) predict IRT difficulty and discrimination param- eters and (ii) filter out low-quality items? Beyond generating insights into the connections between IRT and IWFs, this paper introduces a general-purpose anal- ysis methodology that leverages AI-enabled capabilities to scale qualitative evaluations proposed in the learning sci- ences literature [38, 32, 60, 1], alongside statistical measure- ments derived from large-scale student data. In the present context, this hybrid framework offers a more comprehen- sive and rigorous assessment of MCQ quality, equipping ed- ucators, test developers, and researchers with actionable in- sights to design more effective and equitable assessments. 2. RELATED WORK 2.1 IRT-based Item Validation Item validation is critical to ensure that assessments measure intended constructs, have appropriate difficulty and discrim- ination properties, and provide fair evaluations across di- verse populations (e.g., gender and age groups) [21]. Within the IRT framework, item validation involves conducting pi- lot studies to collect sufficient student response data for reli- able parameter estimation–an expensive and time-consuming process [6]. To reduce the amount of student data needed for reliable estimation, recent work proposed multi-armed bandit algorithms as a more data-efficient approach to adap- tively refine item parameters [53]. Additionally, to warrant fairness and equity of assessments, differential item function- ing (DIF) is analyzed to ensure that items do not advantage or disadvantage any particular group of test-takers [17]. As an alternative to student data-driven validation meth- ods, researchers have explored natural language processing (NLP) techniques to predict IRT parameters based on an item’s syntactic and semantic features [2]. Several studies have applied neural networks, such as LSTMs and Trans- formers, to analyze item text and estimate discrimination and difficulty (e.g., [13, 9, 46]). These predictions can help mitigate the cold-start problem, reducing the amount of stu- dent response data needed for reliable parameter identifica- tion [35]. In parallel with the development of the Duolingo English Test, researchers have introduced methods to accel- erate IRT parameter initialization, iterative calibration, and assessment item validation [61, 54, 53]. The present study utilizes large-scale student response data to estimate IRT parameters for 7,126 questions and employs an automated approach that combines rule-based methods and LLMs to annotate each question with a 19-criteria learn- ing science rubric for Item-Writing Flaws (IWF) [37]. Unlike prior work focused on improving IRT parameter prediction from textual features, our study aims to enhance our un- derstanding of relationships between IWF-based and IRT- based item validation methods, providing insights into how linguistic characteristics influence psychometric properties.2.2 Learning Science Rubrics Rubrics play a central role in education by providing a struc- tured means of evaluating quality, whether in student sub- missions or instructional and assessment materials [3]. When applied by trained individuals, rubrics help ensure consis- tency and replicability by offering standardized and inter- pretable evaluation criteria [28]. As a result, rubrics have been used to assess the quality of student-facing resources, including hints, short-answer questions, and multiple-choice questions (MCQs) [43, 24, 40]. For example, Arif et al. [5] employed six question-level metrics–including relevance, an- swerability, and difficulty–to evaluate the quality of LLM- generated MCQs. However, some rubric criteria involve a degree of subjectivity, such as relevance , which may affect inter-rater reliability and make replication more challenging. Factors such as language preference, prior knowledge, and personal definitions of difficulty may lead to inconsistencies in applying the same criteria [55]. Additionally, even rela- tively short rubrics can be time-consuming and cumbersome to apply across large content pools, limiting scalability [22]. Despite the challenges of rubric-based evaluations, one promi- nent instrument that has been widely adopted for MCQ as- sessment is the Item-Writing Flaws (IWF) rubric [22, 58, 16]. Applicable across subject areas, the IWF rubric detects flaws such as gratuitous detail, grammatical cues, and implausible or disproportionately long distractors. Two previous stud- ies in the domain of medical education have shown that the presence of IWFs correlates with psychometric properties such as difficulty and discrimination, with flawed items in- troducing construct-irrelevant variance that can disadvan- tage students and reduce test validity [18, 48]. Compared to simpler automated measures (e.g., diversity, perplexity), the IWF rubric provides a more targeted and pedagogically grounded assessment of MCQ quality [37]. To address the time-intensive nature of manually applying the IWF’s 19 distinct criteria to each MCQ, recent research has focused on automation, enabling IWF rubric application at scale [39, 37]. This automated approach achieved an overall accu- racy of 94% on a dataset of 271 MCQs spanning five educa- tional domains, each annotated with a gold-standard human application of the IWF rubric. In addition to accelerating the evaluation process, this method enhances consistency and detail in assessments by mitigating some of the inher- ent subjectivity in human-applied rubrics [42]. Rubrics are widely used in education, whether for grading assignments or evaluating educational technologies, but they often lack quantitative evidence to demonstrate their effec- tiveness [26]. In this work, we address this gap by provid- ing quantitative proof that the IWF rubric criteria influence both question difficulty and discrimination. Unlike previ- ous research that compared automatically identified IWF criteria with human-applied labels, our approach relies on an automated application that has already been validated [39, 37]. Consequently, we apply these verified annotation methods to thousands of real MCQs drawn from a variety of domains, using student response data to go beyond mere frequency counts. This enables us to offer deeper insights into how specific IWFs differentially affect question quality. Page 3: Table 1: Dataset overview. The first three rows indicate the number of concepts, questions and multiple-choice questions (MCQs). The next three rows describe the number of stu- dents, practice sessions and responses providing data to fit the IRT parameters. The last two rows show the average number of students and questions in each concept. All Life/Earth Physical Math # of concepts 1,033 563 336 134 # of questions 13,158 7,212 4,331 1,649 # of MCQs 7,126 3,792 2,206 1,128 # of students 448k 265k 169k 44k # of sessions 1.9M 1.1M 0.6M 0.15M # of responses 21.1M 12.6M 7.0M 1.6M # stud./conc. 1,848 1,983 1,902 1,155 # quest./conc. 12.8 12.8 12.9 12.3 3. METHODOLOGY 3.1 Study Context and Dataset Our analyses utilize a dataset from the CK-12 Foundation, a US-based nonprofit that provides millions of students with free access to educational resources. CK-12 actively devel- ops and hosts the FlexBook 2.0 system1, an online tutoring platform offering courses across diverse subjects and grade levels. Each course consists of a series of concepts, analo- gous to a textbook chapter and typically consisting of one to four learning objectives. Each concept is associated with a broader lesson topic and with a practice section designed to develop and assess students’ understanding of that concept. For instance, in a Life Science course, a lesson might be “Ge- netics,” with “Punnett Squares” as a concept within it. We focus on popular concepts within middle and high school courses, spanning topics in physical sciences (e.g., physics, chemistry), mathematics (e.g., algebra, geometry), and life and earth sciences, using data from 2023 and 2024. Overall, our study uses data from 448,000 students inter- acting with 13,158 questions to fit IRT parameter sets for 1,033 distinct concepts (Table 1). All questions were writ- ten by human domain experts. As the Item-Writing Flaw (IWF) rubric studied in this paper is designed specifically for multiple-choice questions (MCQs) [58], we assess the rela- tionships between IWFs and question-specific difficulty and discrimination parameters based on the 7,126 MCQs within the content pool. The following discusses the IRT parameter estimation and IWF annotation processes in detail. 3.2 Item Response Theory Item Response Theory (IRT) is a methodological framework commonly used in high-stakes assessments, such as college entrance exams (e.g., SAT and GRE) [17]. Formally, IRT models interactions between students and a set of test items (i.e., questions) under binary response outcomes ( {0,1}). The idea is to assign each student a latent ability parameter that explains their response probabilities, estimated using probabilistic inference. The relationship between student ability and response correctness probabilities is modeled by fitting a sigmoid function for each item, commonly referred to as item response function (IRF). 1https://www.ck12.orgFor each item j, its IRF reaches its steepest slope at a spe- cific point on the x-axis, representing the item’s difficulty δj. The steepness of the IRF reflects the item’s discrimination property, denoted as αj. Given student ability θi, along with item difficulty and discrimination parameters, the probabil- ity of student ianswering item jcorrectly is defined as P(Xi,j= 1|θi, αj, δj) =1 1−e−αj(θi−δj), (1) where Xi,jindicates the binary response outcome. Xis the potentially sparse item response matrix capturing all inter- actions between students and items. Given student response data X, the parameters of the IRT model defined by Equa- tion 1 are fitted via maximum likelihood estimation. Our study utilizes the R package MIRT [15] to estimate a separate set of IRT parameters for each of the 1,033 con- cepts within our dataset. Following guidance from de Ay- ala [17], we ensure a robust IRT parameter estimation by focusing on concepts that meet the following criteria: at least 500 students (each submitting a minimum of 5 re- sponses) and at least 10 questions (each receiving a min- imum of 500 responses). While we use data from multi- ple question types (e.g., short-answer and multiple-choice) for the initial parameter estimation process, the subsequent analysis of the relationship between IWFs and IRT param- eters focuses solely on MCQs (details in Table 1). 3.3 Item-Writing Flaws Application We evaluate the quality of MCQs based on the 19-criteria IWF rubric [58] (see Table 2). This learning science rubric is domain-agnostic, and prior research has validated its utility in medical education, mathematics, and science domains [22, 58, 16]. Given the immense resources required for domain experts to annotate the more than 7,000 MCQs in our dataset, we utilize the Scalable Automatic Question Usability Evalu- ation Toolkit (SAQUET) [37], an open-source method that facilitates the automated application of the IWF rubric. SAQUET has been shown to closely align with expert hu- man application of the rubric, achieving an overall accuracy of 94% when applied to MCQs across five subject areas [37]. Compared to human evaluators, it is more likely to classify an MCQ as having an IWF, making it a stricter tool that errs on the side of caution. Additionally, the study demon- strated that SAQUET offers a more comprehensive evalu- ation of the quality of the question than traditional auto- mated approaches, such as perplexity or cognitive complex- ity. The toolkit combines rule-based approaches with Large Language Model (LLM) verifications (via GPT-4o [25]) to determine whether an item satisfies or violates each of the rubric’s 19 criteria [39]. For surface-level flaws or those that pertain to wording, such as identifying the presence of none of the above options or vague terms , SAQUET employs rule- based techniques that rely on verb tense detection, keyword matching, and other straightforward heuristics. For crite- ria requiring domain-specific or pedagogical judgment, such as evaluating whether the text contains gratuitous informa- tion, the system incorporates a final verification step using the LLM to provide a “judgment call”. This step involves prompting the LLM (GPT-4o) to confirm or refute the sus- pected flaw based on the item’s content. Page 4: Table 2: Definitions of the 19-criteria within the Item-Writing Flaw (IWF) rubric [58]. IWF Criteria Definition Ambiguous/Unclear The question text and options should be written in clear, precise language to avoid confusion Implausible Distractors All incorrect answer choices should be realistic and plausible None of the Above Avoid using any variation of “none of the above” since it primarily tests students’ ability to identify wrong answers Longest Option Correct The correct answer should not be noticeably longer or contain more detail than the other options, as this can unintentionally guide students to the correct answer Gratuitous Information Avoid unnecessary details in the question text that do not contribute to answering the question True/False Question Avoid answer choices structured as a series of true or false statements Convergence Cues Ensure answer choices do not contain overlapping words that might hint at the correct answer Logical Cues Avoid clues in the stem and the correct option that can help the test-wise student to identify the correct answer All of the Above Avoid using any variation of “all of the above” as students can guess the correct answer just by recognizing one correct option Fill in the Blank Avoid missing words in the middle of a sentence, as this forces students to rely on partial information Absolute Terms Avoid extreme words like “always” or “never” in answer choices, as these are usually false Word Repeats Ensure words or phrases from the question text are not repeated only in the correct answer, as this can inadvertently reveal the right choice Unfocused Stem The question text should be clear and specific so that students can understand it without needing to read the answer choices Complex or K-type Avoid overly complex questions that require selecting from a number of possible combinations of the responses, as they may test strategy rather than knowledge Grammatical Cues All options should be grammatically consistent with the question text and should be parallel in style and form Lost Sequence Arrange options in a logical order (e.g., chronological or numerical) to improve readability and fairness Vague Terms Avoid the use of vague words (e.g. frequently, occasionally) in the options as their meaning can be subjective More than One Correct In single-answer multiple-choice questions, ensure there is a single best answer to avoid ambiguity Negative Wording Avoid usage of negative wording in the question text, as it can confuse students The output of SAQUET is a labeled dataset in which each item (i.e., MCQ) is annotated with a vector xi∈ {0,1}19 of binary indicators, specifying the presence or absence of a specific flaw as characterized by the 19-criteria rubric. 3.4 Analysis Methodology After using student data for IRT parameter estimation and SAQUET for IWF rubric application, we define our analysis dataset as D={(αi, δi,xi)}7126 i=1. Here, each MCQ iin the content pool is characterized by its discrimination parameter αi, difficulty parameter δi, and a binary vector xi∈ {0,1}19 indicating which flaws apply. We further define domain- specific datasets ( DLife/Earth ,DPhysical ,DMath) to study po- tential differences across subject areas (Table 1). Using these datasets, we address our research questions through a mixed methodology that combines traditional regression analysis with modern machine learning algorithms. For RQ1, we employ linear regression analysis to study how the number of flaws relates to each MCQ’s difficulty and discrimination parameters. In particular, we fit two models δi=β0+β1∥xi∥1+ϵi, α i=γ0+γ1∥xi∥1+ηi(2) where δiandαiare the difficulty and discrimination for MCQ i. The predictor variable is the total number of iden- tified flaws ∥xi∥1. The coefficients β1andγ1capture the direction and magnitude of the association between the num-ber of flaws and each IRT parameter. The error terms ϵiand ηiaccount for unexplained variance. By fitting these models separately for the full dataset and each domain-specific sub- set, we investigate how flaw prevalence influences difficulty and discrimination across educational domains. For RQ2, we use linear regression analysis to identify which IWF criteria are most strongly associated with MCQ diffi- culty and discrimination. Specifically, for each IWF rubric f∈ {1, . . . , 19}we fit two models: δi=β0,f+β1,fxi,f+ϵi,f, αi=γ0,f+γ1,fxi,f+ηi,f(3) where xi,findicates the presence of flaw fin MCQ i. The coefficients β1,fandγ1,fquantify the relationship between each IWF criteria and the difficulty and discrimination pa- rameters, respectively. The error terms ϵi,fandηi,faccount for residual variance. By estimating these models, we exam- ine the extent to which each IWF contributes to variations in difficulty and discrimination across the datasets. For RQ3, we investigate the extent to which IWF rubric- based evaluations, derived solely from item text, can serve as a proxy for traditional validation methods that require student response data to estimate IRT parameters. Specif- ically, we assess the predictive power of the flaw indicator vector xiin two tasks: (i) predicting an item’s difficulty (δi) and discrimination ( αi); and (ii) predicting items with Page 5: Table 3: Hyperparameters considered during model training. Model Parameters Regression penalty weight ∈ {10i}4 i=−4, penalty: l2 Random nestimators ∈ {50,100,200,300} Forest max depth ∈ {None ,5,10,20} min samples split∈ {2,5,10} Gradient nestimators ∈ {50,100,200,300} Boosting learning rate∈ {0.001,0.01,0.1,0.2,0.3} max depth ∈ {3,5,7,10} Multi-layer hidden layer sizes∈ {10,50,100}{1,2} Perceptron activation ∈ {relu,tanh} learning rate init∈ {10i}0 i=−4 low discrimination ( αi<0.5), low difficulty ( δi<2), and high difficulty ( δi>2) [6]. To this end, we train machine learning models to determine whether rubric-based flaw an- notations provide sufficient predictive power to support au- tomated item pre-screening across educational domains. We do not train models for identifying high-discrimination ques- tions, as high discrimination is not an item flaw. Our evaluations consider a diverse range of parametric and non-parametric machine learning algorithms, including lin- ear/logistic regression, random forest, gradient boosting, and multi-layer perceptron (MLP), using implementations from the Python package scikit-learn [41]. For the regression tasks, we evaluate model fit using root mean squared error (RMSE) and assess predictive power using explained vari- ance (R2) and Pearson correlation ( r). For the classifica- tion tasks, we measure performance using accuracy (ACC), area under the curve (AUC), and F1-score. Given the class imbalance–where approximately 90% of items exhibit “be- nign” IRT parameter values–AUC and F1-score are partic- ularly relevant, as they provide a more robust evaluation of model performance in imbalanced classification tasks. Our results report average performance metrics across a 5- fold cross-validation. In each fold, 80% of the items in the dataset are used for model training and grid search-based hy- perparameter selection, and 20% are used for testing. Thus, all results are based solely on predictions for items that were not observed during training. Table 3 outlines the hyperpa- rameter spaces considered for each algorithm. 4. RESULTS Using data from 448,000 students, we fitted IRT models for each of the 1,033 concepts. Assessing discrimination and dif- ficulty parameters of all 7,126 MCQs, we flagged 789 (11.1%) for low discrimination, 773 (10.8%) for low difficulty, and 134 (1.9%) for high difficulty (Table 4). Across the domain- specific datasets, we observed that Life/Earth Sciences and Math showed the highest and lowest proportions of flagged questions, respectively (low discrimination 12.1% vs. 7.3%, low difficulty 10.5% vs. 4.3%, and high difficulty 7.3% vs. 1.6%). These findings suggest that Math MCQs within in- dividual concepts have more homogeneous difficulty levels compared to Science MCQs. Overall, we find that the vast majority of MCQs exhibit desirable IRT parameters. This implies that our item screening models have to manage class imbalance when trying to predict whether an item has de- sirable difficulty and discrimination (RQ3).Table 4: Analysis Overview. The first section details the total number of questions and those flagged for low discrimination and low/high difficulty based on IRT analysis. The second section reports the total number of IWFs identified and aver- age number per question. The last section highlights the five most common IWFs and their prevalence across domains. All Life/Earth Physical Math # of questions 7,126 3,792 2,206 1,128 - low discrimination 773 459 232 82 - low difficulty 789 538 203 48 - high difficulty 134 78 38 18 # of IWFs 10,537 5,647 3,062 1,828 IWFs per quest. 1.479 1.489 1.388 1.621 ambiguous/unclear 31.3% 27.8% 30.0% 45.9% fill in the blank 22.4% 29.2% 18.4% 7.8% multiple correct 14.1% 14.3% 14.4% 12.7% none of the above 12.5% 15.9% 10.8% 4.1% lost sequence 10.1% 2.8% 13.6% 28.1% In the IWF application, we found that most questions had either no flaws or very few, with 82.5% containing at most two (Figure 1). Among the three domains, Life/Earth Sci- ences featured the highest proportion of flawless MCQs at 22.0%. Math exhibited the highest average number of IWFs per question at 1.62. Still, all three domains demonstrated a similar distribution of IWF numbers, with an overall av- erage of 1.48 IWFs per question. Additional details are pro- vided in Table 4, which highlights the prevalence of the five most common IWFs within each domain. The most frequent flaw identified across all domains was “ambiguous/unclear language” in the question text or answer options, affecting 31.3% of all MCQs. We found “fill-in-the-blank” (fitb) and “none-of-the-above” (nota) formulations to be more preva- lent in the Life/Earth (29.2% fitb, 15.9% nota) and Phys- ical Science (18.4% fitb, 10.8% nota) domains, compared to Math (7.8% fitb, 4.1% nota). The ”lost-sequence” flaw, which indicates that answer options break chronological or numerical order, was significantly more common in Math MCQs at 28.1%. We continue by assessing the impact of IWF numbers and specific IWF criteria on MCQ’s IRT dif- ficulty and discrimination parameters (RQ1 and RQ2). 0 1 2 3 4 5 6 # of IWFs020040060080010001200# of questionsLife/Earth 0 1 2 3 4 5 6 # of IWFsPhysical 0 1 2 3 4 5 6 # of IWFsMath Figure 1: Histograms illustrating the number of IWFs iden- tified per question across the three domain-specific datasets. 4.1 RQ1: IWF Correlations with IRT We conducted regression analyses to study how the number of IWFs relates to IRT discrimination and difficulty param- eters across aggregated and domain-specific datasets (Ta- ble 5). First, focusing on the aggregated dataset containing all 7,126 MCQs, we observe a significant negative relation- ship between IWF frequencies and discrimination parame- ters (ˆ γ1=−0.080, p < 0.001), indicating that items with Page 6: 1.0 0.5 0.0 0.5 1.0 Discrim. Coeff. (1,f) ambiguous/unclearimplausible distractorsnone of the abovelongest option correctgratuitous infotrue/false questionconvergence cueslogical cuesall of the above fill-in-blank absolute termsword repeatsunfocused stemcomplex or k-typegrammatical cueslost sequencevague termsmore than one correctnegative wordedIWF Criteria All 1.0 0.5 0.0 0.5 1.0 Discrim. Coeff. (1,f) Life/Earth 1.0 0.5 0.0 0.5 1.0 Discrim. Coeff. (1,f) Physical 1.0 0.5 0.0 0.5 1.0 Discrim. Coeff. (1,f) Math 1 0 1 Difficulty Coeff. (1,f) ambiguous/unclearimplausible distractorsnone of the abovelongest option correctgratuitous infotrue/false questionconvergence cueslogical cuesall of the above fill-in-blank absolute termsword repeatsunfocused stemcomplex or k-typegrammatical cueslost sequencevague termsmore than one correctnegative wordedIWF Criteria 1 0 1 Difficulty Coeff. (1,f) 1 0 1 Difficulty Coeff. (1,f) 1 0 1 Difficulty Coeff. (1,f) Impact of IWF Criteria on IRT Discrimination/Difficulty Significant (p < 0.05) Not Significant (p 0.05) Figure 2: Linear regression analysis examining the strength of association between each IWF criterion and IRT discrimination and difficulty parameters across the domain-specific datasets. The figure indicates estimated coefficients, 95% confidence intervals, and highlights statistically significant relationships ( p <0.05) in green. Table 5: Linear regression analysis examining relationships between the number of IWFs and IRT discrimination and difficulty parameters across domains. We report estimated coefficients, 95% conf. intervals, and corresponding p-values. Parameter All Life/Earth Physical Math Discrimination -0.080 -0.075 -0.139 -0.016 (±0.011) ( ±0.012) ( ±0.020) ( ±0.037) p = 0.000 p = 0.000 p = 0.000 p = 0.393 Difficulty -0.042 -0.093 0.019 0.030 (±0.025) ( ±0.036) ( ±0.045) ( ±0.050) p = 0.001 p = 0.000 p = 0.417 p = 0.239 higher discrimination were less likely to contain IWFs. This pattern was consistent across Life/Earth (ˆ γ1=−0.075,p < 0.001) and Physical Sciences (ˆ γ1=−0.139,p <0.001), sug- gesting that well-discriminating items in these domains were generally written with fewer flaws. The relationship between IWF frequencies and difficulty parameters showed mixed re- sults. In Life/Earth, the domain with the most questions, there was a significant negative association ( ˆβ1=−0.093,p < 0.001), indicating that easier items were more prone to contain flaws. However, in Physical Sciences and Math, we did not find significant relationships between difficulty parameters and IWF frequencies ( p= 0.417 and p= 0.239, respectively). This suggests that in these domains, the num- ber of IWFs may not be a reliable predictor of item difficulty. 4.2 RQ2: Identifying High-Impact IWFs The second regression analysis aimed to identify which spe- cific IWF criteria are most strongly associated with question discrimination and difficulty parameters. For each dataset and IWF criteria f∈ {1, . . . , 19}, Figure 2 presents the esti- mated discrimination (ˆ γ1,f) and difficulty coefficients ( ˆβ1,f) along with their 95% confidence intervals. Statically signif- icant coefficients ( p < 0.05) are highlighted in green. Ex- amining the combined dataset of 7,126 MCQs, we found significant associations between IRT discrimination and 15 of the 19 IWF criteria, while 13 criteria were significantly associated with difficulty. Among domain-specific datasets, Math exhibited the highest number of significant discrimina- tion coefficients (12), despite having the smallest sample size Page 7: Table 6: Regression Task. We train models that employ IWF features to predict MCQ’s discrimination and difficulty parameters. All Pearson correlation coefficients ( r) are statistically significant at p <0.001. Parameter All Life/Earth Physical Math RMSE R2r RMSE R2r RMSE R2r RMSE R2r Discrimination Lin. Regr. 0.491 0.121 0.348 0.396 0.129 0.359 0.476 0.178 0.422 0.666 0.071 0.269 Rnd. Forest 0.487 0.138 0.373 0.396 0.131 0.362 0.475 0.185 0.430 0.666 0.071 0.268 Grad. Boost. 0.485 0.142 0.377 0.395 0.132 0.364 0.472 0.192 0.439 0.666 0.071 0.268 MLP 0.486 0.141 0.375 0.396 0.130 0.361 0.474 0.187 0.433 0.666 0.072 0.273 Difficulty Lin. Regr. 1.128 0.115 0.338 1.161 0.189 0.435 1.099 0.102 0.319 0.914 0.017 0.141 Rnd. Forest 1.067 0.209 0.457 1.109 0.259 0.510 1.054 0.174 0.420 0.917 0.012 0.156 Grad. Boost. 1.064 0.213 0.462 1.111 0.257 0.507 1.062 0.161 0.405 0.914 0.018 0.136 MLP 1.062 0.216 0.465 1.106 0.264 0.514 1.043 0.192 0.438 0.908 0.030 0.176 (N= 1,128), suggesting stronger associations with IWF cri- teria compared to Earth/Life (9) and Physical Sciences (11). In contrast, difficulty coefficients were more frequently sig- nificant for Life/Earth (10) and Physical Sciences (10) than for Math (6), highlighting differences in how IWFs impact IRT parameters across educational domains. Shifting our focus on individual IWFs, we found that the flaws most negatively associated with IRT discrimination and difficulty parameters were“longest option correct”(ˆ γ1,f= −0.370, ˆβ1,f=−0.691), “more than one correct” (ˆ γ1,f= −0.366, ˆβ1,f=−0.928), and “all of the above” (ˆ γ1,f= −0.322, ˆβ1,f=−0.806). These flaws likely introduce textual cues that inadvertently hint at the correct answer, dimin- ishing the quality of test items. We observed that the “lost sequence” criteria had a positive discrimination coefficient (ˆγ1,f= 0.314) and was also significant for two of the three domain-specific datasets. Several IWFs were associated with increased question difficulty, including “convergence cues” (ˆβ1,f= 0.679), “grammatical cues” ( ˆβ1,f= 0.526) and “neg- ative wording” ( ˆβ1,f= 0.454), suggesting that these flaws may contribute to cognitive load or confusion beyond the intended subject knowledge assessment. 4.3 RQ3: IWF-Based IRT Predictions Using the IWF annotations as input features, we trained machine learning models to predict questions’ difficulty and discrimination parameters. The performance of the result- ing models, as shown in Table 6, varied across educational domains and predicted parameters. For the discrimination parameter, when trained on the dataset comprising all 7,126 MCQs, the models achieved Pearson correlation coefficients (r) ranging from 0.348 to 0.377 and explained variance ( R2) ranging from 0.121 to 0.141, indicating moderate predictive strength. For the difficulty parameter, the Random For- est and MLP models showed the highest Pearson correla- tions ( r= 0.457 and r= 0.465, respectively) and explained variance ( R2= 0.209 and R2= 0.216, respectively), sug- gesting more effective utilization of the IWF features. No- tably, non-linear models (Random Forest, Gradient Boost- ing, and MLP) consistently outperformed the linear regres- sion model, indicating that modeling non-linear interactions between IWF features can improve predictive accuracy. We observed substantial differences between the domain-specific models. For instance, in Life/Earth sciences, the RandomTable 7: Classification task. We train models that employ IWF features to predict MCQs with low discrimination and low/high difficulty. To highlight class imbalance, we include a baseline assigning all MCQs to the majority class in gray. Task Life/Earth Physical ACC AUC F1 ACC AUC F1 Disc. Low 0.879 0.500 0.000 0.895 0.500 0.000 Log. Regr. 0.880 0.736 0.249 0.909 0.784 0.435 Rnd. Forest 0.890 0.746 0.344 0.907 0.781 0.403 Grad. Boost. 0.888 0.741 0.354 0.910 0.799 0.432 MLP 0.882 0.730 0.364 0.908 0.779 0.400 Diff. Low 0.858 0.500 0.000 0.908 0.500 0.000 Log. Regr. 0.910 0.818 0.649 0.934 0.778 0.516 Rnd. Forest 0.910 0.809 0.636 0.932 0.760 0.498 Grad. Boost. 0.910 0.825 0.644 0.933 0.784 0.506 MLP 0.908 0.808 0.639 0.932 0.757 0.514 Diff. High 0.979 0.500 0.000 0.983 0.500 0.000 Log. Regr. 0.979 0.684 0.000 0.983 0.789 0.000 Rnd. Forest 0.979 0.706 0.000 0.983 0.618 0.000 Grad. Boost. 0.979 0.688 0.000 0.983 0.727 0.000 MLP 0.979 0.681 0.021 0.983 0.675 0.000 Forest model achieved the highest Pearson correlation ( r= 0.510) and explained variance ( R2= 0.259). However, in Math, all models struggled with both discrimination (max R2= 0.071) and difficulty predictions (max R2= 0.030). We evaluate the utility of IWF features for predicting MCQs with low discrimination and low/high difficulty in Life/Earth and Physical Science datasets, where the prior regression analysis confirmed the predictive power of IWFs. Table 7 shows that models trained on IWF features achieve AUC scores of up to 0.746 (random forest) and 0.799 (gradient boosting) for low discrimination, and 0.825 (gradient boost- ing) and 0.784 (logistic regression) for low difficulty. While the AUC scores suggest strong predictive performance, F1 scores remain relatively low for low-discrimination MCQs (peaking at 0.364 for Life/Earth and 0.435 for Physical Sci- ences), indicating challenges due to class imbalance (Ta- ble 4). In contrast, F1 scores for low-difficulty questions are considerably higher, with logistic regression achieving 0.649 for Life/Earth and 0.516 for Physical Sciences, suggesting that IWFs are particularly informative for identifying low- difficulty MCQs. In contrast, none of the classifiers trained to predict high difficulty MCQs outperformed a baseline that Page 8: 0.0 0.2 0.4 0.6 0.8 1.0 Classification Threshold0.00.20.40.60.81.0Score 0.62Life/Earth: Precision/Recall of Low Diff. MCQ Detection Log. Regr. Recall Rnd. Forest Recall Grad. Boost. Recall MLP Recall Log. Regr. Precision Rnd. Forest Precision Grad. Boost. Precision MLP PrecisionFigure 3: Precision and recall curves for predicting low- difficulty Life/Earth Science MCQs for different classifiers. The curves show trade-off between precision and recall across classification thresholds. Using a threshold of 0.62, logistic re- gression achieves a precision of 0.801 and a recall of 0.472. always predicts the majority class. This is likely due to class imbalance and the fact that IWFs are not designed to assess the knowledge required to answer domain-specific questions. Since the IWF-based classification models demonstrated the highest predictive performance for identifying low-difficulty MCQs in the Life/Earth Science dataset, we conducted a follow-up analysis to assess their potential for automated item pre-screening. Figure 3 illustrates the trade-offs be- tween precision and recall across different classification thresh- olds. Recall represents the proportion of low-difficulty MCQs correctly identified by the models, while precision reflects the fraction of flagged MCQs that genuinely belong to the low- difficulty category. By setting the classification threshold to 0.62, the logistic regression model achieves a high preci- sion of 0.801 while maintaining a moderate recall of 0.472. This balance underscores the practical utility of IWF-based classifiers in supporting experts in test item development by enabling early identification of low-difficulty questions, potentially lowering the need for student data collection. 5. DISCUSSION Our study integrated statistical and machine learning meth- odologies with large-scale student data, capturing interac- tions with thousands of questions. This approach provided insights into how qualitative aspects of question design in- fluence traditional measures of question performance derived from item response theory (IRT) [17]. Specifically, we ex- amined the relationships between the standard 19-criteria IWF rubric [58] for MCQs and IRT parameters across var- ious educational domains (e.g., math and natural sciences). Our findings offer quantitative evidence demonstrating how the frequency and specific types of IWFs impact question discrimination and difficulty. Additionally, we validated the utility of IWF evaluations as features for predicting IRT pa- rameters and for identifying low-difficulty questions. Across the three domains we examined, the frequency ofIWFs consistently relates to item discrimination, yet its re- lationship to item difficulty appears to be domain-specific. MCQs with fewer flaws tended to show higher discrimina- tion, indicating that IWFs can diminish a question’s reliabil- ity. As noted in prior work, certain IWFs may inadvertently aid students in guessing the correct answer (e.g., “longest answer correct” or “all of the above”) [58, 22], while oth- ers add confusion unrelated to the content itself. In either case, such flaws can distort how accurately the question dis- criminates between more- and less-knowledgeable students. In Life/Earth Sciences, we observed that easier items con- tained more flaws, suggesting the frequent presence of the flaws that may effectively simplify questions; however, this trend did not surface in Physical Sciences or Math. This discrepancy underscores the possibility that IWFs and item difficulty interact in a domain-dependent manner. It may also reflect variations in how effectively the automated IWF detection methods operate across different subject areas, aligning with previous findings that showed stronger perfor- mance in Humanities and Healthcare than in Chemistry [37]. Consequently, while IWF frequency appears to be a reliable indicator of item discrimination overall, its utility in predict- ing item difficulty likely hinges on both the domain and the strengths or limitations of automated detection techniques. Our findings indicate that most IWF criteria significantly in- fluence both item discrimination and difficulty, though cer- tain flaws exhibit particularly strong and consistent effects. Specifically, flaws such as ”longest option correct” and ”all of the above” show the most substantial negative associa- tions with both metrics. This is likely because they intro- duce cues that enable students to guess the correct answer without engaging with the intended knowledge. In contrast, the ”lost sequence” flaw had a positive effect on discrimi- nation across multiple datasets, suggesting that sequence- based tasks may require more focused reasoning skills, thus better distinguishing between higher- and lower-performing students. Additionally, flaws such as ”convergence cues”, ”grammatical cues”, and ”negative wording” were associated with higher item difficulty. This suggests that these flaws may elevate cognitive load by requiring students to navigate complex text structures rather than directly demonstrating their domain knowledge. Consistent with previous work, the presence of ”all of the above” as an answer choice decreased the difficulty of the question [48] While some flaws consis- tently diminish both question quality and rigor, our findings highlight how specific IWFs exert their influence differently. They might make questions easier to guess or introduce ad- ditional cognitive demands that may confuse students. Oth- ers appear to have more nuanced effects, warranting further investigation into their role in shaping assessment validity and fairness. Machine learning models trained to predict question discrim- ination and difficulty parameters based on IWF annotations achieve moderate predictive power, with performance vary- ing across subject domains (Table 6). Notably, prediction accuracy was higher for Life/Earth and Physical Science questions compared to Math, particularly for difficulty es- timation. This aligns with our regression analysis, which revealed stronger associations between IWFs and difficulty parameters in the Science domains (Table 2). Across all prediction tasks, non-linear models (e.g., Gradient Boosting Page 9: and MLP) consistently outperformed linear models, high- lighting the need to capture complex interactions between individual IWF features. To assess the practical utility of IWF features for item screening, we evaluated classification models designed to identify questions with low discrimina- tion, low difficulty, and high difficulty. Our results suggest that by selecting a classification threshold that balances pre- cision and recall, IWF-based models can assist domain ex- perts in identifying low-difficulty questions early. In partic- ular, criteria such as ”all of the above” and ”longest option correct” showed strong associations with low item difficulty, likely explaining why our classifiers performed significantly better at predicting low-difficulty MCQs compared to high- difficulty ones. The latter task likely requires models to assess the specific knowledge needed to answer a question within a given domain, underscoring the limitations of IWF features for difficulty prediction. By examining how qualitative question design guidelines [58] align with robust statistical measurements derived from large student datasets, our study contributes to ongoing research efforts on characterizing effective instructional design prin- ciples [29]. From a learning science perspective, rubrics serve as distilled representations of expert knowledge used to assess the quality of educational materials and instruc- tion (e.g., [44, 59, 32]). Understanding the relationship be- tween expert evaluation rubrics and student learning is cru- cial, especially as AI-driven learning technologies increas- ingly rely on textual descriptions of effective pedagogical strategies [56, 52, 45, 27]. 6. LIMITATIONS AND FUTURE WORK While this study established relationships between a domain- general IWF rubric [58] and statistical measures of question difficulty and discrimination derived from IRT [17], several limitations should be considered. First, our analysis was lim- ited to science and mathematics courses within a large-scale online tutoring platform at the middle and high school lev- els. Future research should explore the applicability of these findings across other subject areas, including language, hu- manities, and social sciences. Additionally, further valida- tion is needed in higher and professional education contexts, particularly in medical education, where MCQ-based assess- ments are widely used [18, 48]. Finally, beyond education, IWFs may influence the reliability of other assessments, such as psychological evaluations of personality traits and mental states, where MCQs play a central role [49]. Investigating these broader implications would enhance our understand- ing of IWFs across diverse testing environments. Although the IWF rubric provides a domain-general method- ology for experts to assess the pedagogical soundness of test items without relying on student data [58], our findings indi- cate that its features are only moderate predictors of MCQ difficulty and discrimination parameters (Table 6). By de- sign, IWFs focus on broad design principles, such as en- suring that all distractors are plausible, but do not capture domain-specific nuances related to the knowledge required to solve a particular test item. An item may fully adhere to IWF guidelines yet still exhibit high or low difficulty levels depending on the complexity of the subject knowledge it as- sesses. To address these limitations, future research could explore hybrid approaches that combine the interpretabilityof IWF-based evaluations with the predictive power of deep learning models, which estimate IRT parameters based on semantic analyses of question text [2]. Another promising direction is the development of enhanced evaluation rubrics that integrate human domain expertise with data-driven in- sights generated by machine learning algorithms, thereby improving their predictive accuracy [30, 34, 7]. Across the courses examined in this study, the IWF anal- ysis identified an average of 1.48 writing flaws per MCQ (Table 4). While many of these flaws may have minimal impact on student learning outcomes, addressing them re- mains essential for ensuring content quality. Future work will focus on developing AI-assisted content authoring tools to support domain experts in MCQ generation and refine- ment [37]. Recent advancements in LLM-enabled pipelines for question generation and validation offer promising di- rections [10, 33, 23]. To enhance the efficiency of question validation, future research will explore natural language pro- cessing and reinforcement learning algorithms to reduce the amount of student response data required for reliable IRT parameter identification [35, 61, 54, 53]. Lastly, we emphasize the broader utility of evaluation method- ologies that integrate generative AI to scale qualitative as- sessments based on learning science rubrics with statistical measures derived from student data. This hybrid approach can generate robust and actionable insights for improving educational practice. Future research will extend this frame- work to evaluate other types of educational materials, such as hints [44, 51, 57], textbooks [59], and illustrations [4]. Ad- ditional directions include examining the predictive validity of rubric-based evaluations in educational domains such as project-based learning [14, 20, 1], discourse analysis [32, 12] and programming education [14, 50]. 7. CONCLUSION This paper explored relationships between the 19-criteria Item-Writing Flaws (IWFs) rubric, a domain-general quali- tative method for question validation [58], and item response theory (IRT), a traditional, data-driven approach to assess- ing question quality [17]. Using an automated method, we applied the rubric to over 7,000 multiple-choice questions spanning mathematics, physical sciences, and life/earth sci- ence domains, analyzing how the number and types of IWFs impact question difficulty and discrimination parameters. Three key findings emerged. First, a higher number of IWFs was associated with lower item difficulty and discrimination in life/earth sciences, while the relationship was less con- sistent in mathematics and physical sciences. Second, spe- cific IWF criteria strongly correlated with question difficulty, such as “longest option correct” for easier items and “con- vergence cues” for harder ones, demonstrating how superfi- cial textual cues can compromise an otherwise well-designed question. Third, while models trained on IWF features did not match the precision of IRT-based methods, they showed promise for preliminary screening, particularly in identifying low-difficulty questions. These findings show the dual role of domain-agnostic and domain-specific factors in developing high-quality test items. On the one hand, a rubric that flags generic writing flaws can serve as a scalable “first pass”, helping content authors Page 10: identify potential design issues before pilot testing. On the other hand, IWF features alone are only moderate predictors of IRT parameters, with predictive strength varying across educational domains. This highlights that IWF-based eval- uation cannot replace traditional student data-dependent methods, such as those embodied in IRT. Future work could explore hybrid approaches that integrate the interpretability of human-readable rubrics with the flexibility of machine- learning models capable of capturing semantic information related to domain-specific knowledge to enhance the accu- racy of IRT parameter predictions. This systematic align- ment of qualitative rubrics with quantitative validation not only helps improve item quality at scale but also ensures that computer-assisted assessments support fair, reliable, and pedagogically meaningful testing. Acknowledgments We thank Microsoft for support in the form of Azure com- puting and access to the OpenAI API through a grant from their Accelerate Foundation Model Academic Research Pro- gram. We thank the CK-12 Foundation (ck12.org) for pro- viding access to their learning materials and to data on stu- dent responses to those materials. This research was sup- ported in part by the AFOSR under award FA95501710218. References [1] G. Aher, R. Schmucker, T. Mitchell, and Z. C. Lip- ton. Ai mentors for student projects: Spotting early issues in computer science proposals. arXiv preprint arXiv:2503.05782 , 2025. [2] S. AlKhuzaey, F. Grasso, T. R. Payne, and V. Tamma. Text-based question difficulty prediction: A systematic review of automatic approaches. International Journal of Artificial Intelligence in Education , 34(3):862–914, 2024. [3] D. Allen and K. Tanner. Rubrics: Tools for making learning goals and evaluation criteria explicit for both teachers and learners. CBE—Life Sciences Education , 5(3):197–203, 2006. [4] A. Angra and S. M. Gardner. The graph rubric: De- velopment of a teaching, learning, and research tool. CBE—Life Sciences Education , 17(4):ar65, 2018. [5] T. Arif, S. Asthana, and K. Collins-Thompson. Gener- ation and assessment of multiple-choice questions from video transcripts using large language models. In Pro- ceedings of the Eleventh ACM Conference on Learning@ Scale , pages 530–534, 2024. [6] F. B. Baker. The basics of item response theory . ERIC, 2001. [7] A. Barany, N. Nasiar, C. Porter, A. F. Zambrano, A. L. Andres, D. Bright, M. Shah, X. Liu, S. Gao, J. Zhang, et al. Chatgpt for education research: exploring the potential of large language models for qualitative code- book development. In International conference on arti- ficial intelligence in education , pages 134–149. Springer, 2024. [8] S. P. Bates, R. K. Galloway, J. Riise, and D. Homer. Assessing the quality of a student-generated questionrepository. Physical Review Special Topics-Physics Ed- ucation Research , 10(2):020105, 2014. [9] L. Benedetto. A quantitative study of nlp approaches to question difficulty estimation. In International Confer- ence on Artificial Intelligence in Education , pages 428– 434. Springer, 2023. [10] S. Bhandari, Y. Liu, Y. Kwak, and Z. A. Pardos. Evalu- ating the psychometric properties of chatgpt-generated questions. Computers and Education: Artificial Intelli- gence , 7:100284, 2024. [11] R. J. Boland, N. A. Lester, and E. Williams. Writ- ing multiple-choice questions. Academic Psychiatry , 34:310–316, 2010. [12] C. Borchers, K. Yang, J. Lin, N. Rummel, K. R. Koedinger, and V. Aleven. Combining dialog acts and skill modeling: What chat interactions enhance learn- ing rates during ai-supported peer tutoring? In Pro- ceedings of the 17th International Conference on Edu- cational Data Mining , 2024. [13] M. Byrd and S. Srivastava. Predicting difficulty and discrimination of natural language questions. In Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers) , pages 119–130, 2022. [14] V. Catet´ e, E. Snider, and T. Barnes. Developing a rubric for a creative cs principles lab. In Proceedings of the 2016 ACM Conference on Innovation and Tech- nology in Computer Science Education , pages 290–295, 2016. [15] R. P. Chalmers. mirt: A multidimensional item re- sponse theory package for the r environment. Journal of statistical Software , 48:1–29, 2012. [16] E. Costello, J. Holland, and C. Kirwan. The future of online testing and assessment: question quality in moocs. International Journal of Educational Technol- ogy in Higher Education , 15(1):1–14, 2018. [17] R. J. De Ayala. The theory and practice of item re- sponse theory . Guilford, New York, NY, USA, 2013. [18] S. M. Downing. The effects of violating standard item writing principles on tests and students: the conse- quences of using flawed test items on achievement ex- aminations in medical education. Advances in health sciences education , 10:133–143, 2005. [19] A. H. Elgadal and A. A. Mariod. Item analysis of multiple-choice questions (mcqs): assessment tool for quality assurance measures. Sudan Journal of Medical Sciences , 16(3):334–346, 2021. [20] M. Goyal, C. Gupta, and V. Gupta. A meta-analysis ap- proach to measure the impact of project-based learning outcome with program attainment on student learning using fuzzy inference systems. Heliyon , 8(8), 2022. [21] T. Haladyna. Developing and validating test items . Routledge, 2013. Page 11: [22] T. M. Haladyna, S. M. Downing, and M. C. Rodriguez. A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in educa- tion, 15(3):309–333, 2002. [23] J. He-Yueya, N. D. Goodman, and E. Brunskill. Evaluating and optimizing educational content with large language model judgments. arXiv preprint arXiv:2403.02795 , 2024. [24] A. Horbach, I. Aldabe, M. Bexte, O. L. de Lacalle, and M. Maritxalar. Linguistic appropriateness and peda- gogic usefulness of reading comprehension questions. InProceedings of the Twelfth Language Resources and Evaluation Conference , pages 1753–1762, 2020. [25] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. [26] G. Janssen, V. Meier, and J. Trace. Building a bet- ter rubric: Mixed methods rubric revision. Assessing writing , 26:51–66, 2015. [27] I. Jurenka, M. Kunesch, K. R. McKee, D. Gillick, S. Zhu, S. Wiltberger, S. M. Phal, K. Hermann, D. Kasenberg, A. Bhoopchand, et al. Towards re- sponsible development of generative ai for educa- tion: An evaluation-driven approach. arXiv preprint arXiv:2407.12687 , 2024. [28] V. Kind. Development of evidence-based, student- learning-oriented rubrics for pre-service science teach- ers’ pedagogical content knowledge. International Jour- nal of Science Education , 41(7):911–943, 2019. [29] K. R. Koedinger, J. L. Booth, and D. Klahr. Instruc- tional complexity and the science to constrain it. Sci- ence, 342(6161):935–937, 2013. [30] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottle- neck models. In International conference on machine learning , pages 5338–5348. PMLR, 2020. [31] G. Kurdi, J. Leo, B. Parsia, U. Sattler, and S. Al-Emari. A systematic review of automatic question generation for educational purposes. International Journal of Ar- tificial Intelligence in Education , 30:121–204, 2020. [32] X. Liu, J. Zhang, A. Barany, M. Pankiewicz, and R. S. Baker. Assessing the potential and limits of large lan- guage models in qualitative coding. In International Conference on Quantitative Ethnography , pages 89–103. Springer, 2024. [33] Y. Liu, S. Bhandari, and Z. A. Pardos. Leveraging llm- respondents for item evaluation: a psychometric analy- sis.arXiv preprint arXiv:2407.10899 , 2024. [34] J. M. Ludan, Q. Lyu, Y. Yang, L. Dugan, M. Yatskar, and C. Callison-Burch. Interpretable-by-design text classification with iteratively generated concept bottle- neck. arXiv preprint arXiv:2310.19660 , 2023.[35] A. D. McCarthy, K. P. Yancey, G. T. LaFlair, J. Egbert, M. Liao, and B. Settles. Jump-starting item parame- ters for adaptive language tests. In Proceedings of the 2021 conference on empirical methods in natural lan- guage processing , pages 883–899, 2021. [36] P. McCoubrie. Improving the fairness of multiple- choice questions: a literature review. Medical teacher , 26(8):709–712, 2004. [37] S. Moore, E. Costello, H. A. Nguyen, and J. Stam- per. An automatic question usability evaluation toolkit. InInternational Conference on Artificial Intelligence in Education , pages 31–46. Springer, 2024. [38] S. Moore, H. A. Nguyen, N. Bier, T. Domadia, and J. Stamper. Assessing the quality of student-generated short answer questions using gpt-3. In European con- ference on technology enhanced learning , pages 243–257. Springer, 2022. [39] S. Moore, H. A. Nguyen, T. Chen, and J. Stamper. Assessing the quality of multiple-choice questions using gpt-4 and rule-based methods. In European Confer- ence on Technology Enhanced Learning , pages 229–245. Springer, 2023. [40] N. Mulla and P. Gharpure. Automatic question gener- ation: a review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelli- gence , 12(1):1–32, 2023. [41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research , 12:2825–2830, 2011. [42] M. J. Peeters. Measuring rater judgments within learn- ing assessments—part 2: A mixed approach to creating rubrics. Currents in Pharmacy Teaching and Learning , 7(5):662–668, 2015. [43] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle, V. Catet´ e, and T. Barnes. A comparison of the qual- ity of data-driven programming hint generation algo- rithms. International Journal of Artificial Intelligence in Education , 29:368–395, 2019. [44] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle, V. Catet´ e, and T. Barnes. A comparison of the qual- ity of data-driven programming hint generation algo- rithms. International Journal of Artificial Intelligence in Education , 29:368–395, 2019. [45] R. Puech, J. Macina, J. Chatain, M. Sachan, and M. Kapur. Towards the pedagogical steering of large language models for tutoring: A case study with modeling productive failure. arXiv preprint arXiv:2410.03781 , 2024. [46] D. Reyes, A. Jimenez, P. Dartnell, S. Lions, and S. R´ ıos. Multiple-choice questions difficulty prediction with neu- ral networks. In International Conference in Methodolo- gies and intelligent Systems for Techhnology Enhanced Learning , pages 11–22. Springer, 2023. Page 12: [47] T. Rusch, P. B. Lowry, P. Mair, and H. Treiblmaier. Breaking free from the limitations of classical test the- ory: Developing and measuring information systems scales using item response theory. Information & Man- agement , 54(2):189–203, 2017. [48] B. R. Rush, D. C. Rankin, and B. J. White. The impact of item-writing flaws and item complexity on exami- nation item difficulty and discrimination value. BMC medical education , 16:1–10, 2016. [49] J. Rust and S. Golombok. Modern psychometrics: The science of psychological assessment . Routledge, 2014. [50] D. Saito, R. Yajima, H. Washizaki, and Y. Fukazawa. Validation of rubric evaluation for programming educa- tion. Education Sciences , 11(10):656, 2021. [51] R. Schmucker, N. Pachapurkar, S. Bala, M. Shah, and T. Mitchell. Learning to give useful hints: Assistance action evaluation and policy improvements. In Respon- sive and Sustainable Educational Futures , pages 383– 398, Cham, 2023. Springer Nature Switzerland. [52] R. Schmucker, M. Xia, A. Azaria, and T. Mitchell. Ruffle&riley: Insights from designing and evaluating a large language model-based conversational tutoring system. In Artificial Intelligence in Education , pages 75–90, Cham, 2024. Springer Nature Switzerland. [53] J. Sharpnack, K. Hao, P. Mulcaire, K. Bicknell, G. LaFlair, K. Yancey, and A. A. von Davier. Bandit- cat and autoirt: Machine learning approaches to com- puterized adaptive testing and item calibration. arXiv preprint arXiv:2410.21033 , 2024. [54] J. Sharpnack, P. Mulcaire, K. Bicknell, G. LaFlair, and K. Yancey. Autoirt: Calibrating item response the- ory models with automated machine learning. arXiv preprint arXiv:2409.08823 , 2024. [55] K. M. Smith, S. Geletta, and A. McArdle. The use of rubrics in the clinical evaluation of podiatric medical students: objectification of the subjective experience. Journal of the American Podiatric Medical Association , 106(1):60–67, 2016. [56] S. Sonkar, L. Liu, D. B. Mallick, and R. Baraniuk. Class: A design framework for building intelligent tu- toring systems based on learning science principles. In Conference on Empirical Methods in Natural Language Processing , 2023. [57] J. Stamper, R. Xiao, and X. Hou. Enhancing llm- based feedback: Insights from intelligent tutoring sys- tems and the learning sciences. In International Confer- ence on Artificial Intelligence in Education , pages 32– 43. Springer, 2024. [58] M. Tarrant, A. Knierim, S. K. Hayes, and J. Ware. The frequency of item writing flaws in multiple-choice ques- tions used in high stakes nursing assessments. Nurse Education Today , 26(8):662–671, 2006. [59] S. W. Watson, X. Shan, B. T. George, and M. L. Peters. Alignment of select elementary science curricula to the next generation science standards via the equip rubric. Curriculum Perspectives , 41(1):17–26, 2021.[60] S. Xu, X. Huang, C. K. Lo, G. Chen, and M. S.-y. Jong. Evaluating the performance of chatgpt and gpt- 4o in coding classroom discourse data: A study of syn- chronous online mathematics instruction. Computers and Education: Artificial Intelligence , 7:100325, 2024. [61] K. P. Yancey, A. Runge, G. Laflair, and P. Mulcaire. Bert-irt: Accelerating item piloting with bert embed- dings and explainable irt models. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) , pages 428–438, 2024.

---