Paper Content:
Page 1:
The Impact of Item-Writing Flaws on Difficulty and
Discrimination in Item Response Theory
Robin Schmucker
Machine Learning Department
Carnegie Mellon University
Pittsburgh, PA, USA
rschmuck@cs.cmu.eduSteven Moore
Human-Computer Interaction
Carnegie Mellon University
Pittsburgh, PA, USA
stevenmo@cs.cmu.edu
ABSTRACT
High-quality test items are essential for educational assess-
ments, particularly within Item Response Theory (IRT).
Traditional validation methods rely on resource-intensive pi-
lot testing to estimate item difficulty and discrimination.
More recently, Item-Writing Flaw (IWF) rubrics emerged as
a domain-general approach for evaluating test items based
on textual features. However, their relationship to IRT pa-
rameters remains underexplored. To address this gap, we
conducted a study involving over 7,000 multiple-choice ques-
tions across various STEM subjects (e.g., math and biology).
Using an automated approach, we annotated each question
with a 19-criteria IWF rubric and studied relationships to
data-driven IRT parameters. Our analysis revealed statis-
tically significant links between the number of IWFs and
IRT difficulty and discrimination parameters, particularly
in life and physical science domains. We further observed
how specific IWF criteria can impact item quality more and
less severely (e.g., negative wording vs. implausible distrac-
tors). Overall, while IWFs are useful for predicting IRT
parameters–particularly for screening low-difficulty MCQs–
they cannot replace traditional data-driven validation meth-
ods. Our findings highlight the need for further research on
domain-general evaluation rubrics and algorithms that un-
derstand domain-specific content for robust item validation.
Keywords
item response theory, item-writing flaws, item analysis, au-
tomated qualitative coding, large language models
1. INTRODUCTION
Multiple-choice questions (MCQs) are recognized as an ef-
fective and widely used form of assessment across diverse
educational domains. Ensuring these questions are of high
quality is critical for maintaining validity, reliability, and
overall soundness of assessing student learning [36, 11]. In
both standardized testing (e.g., GRE, MCAT, SAT) and
classroom assessments, rigorous evaluation is applied to re-tain only the most reliable MCQs [19]. This process allows
educators and researchers to make targeted improvements,
revising or discarding flawed items to better measure stu-
dent learning. Among the established methods for evalu-
ating MCQ quality, Item Response Theory (IRT) is often
considered the gold standard [5, 40]. By quantifying item
performance through parameters such as difficulty and dis-
crimination, IRT provides valuable insights into how stu-
dents interact with different questions.
While IRT has proven effective at capturing statistical di-
mensions of item performance, it does not fully explain why
certain questions might vary in difficulty or discrimination.
It requires substantial student response data and operates
post hoc, often identifying poor-quality questions only af-
ter students have encountered them [47]. Additionally, IRT
parameters may overlook qualitative aspects of question de-
sign, such as pedagogical soundness and specific flaws that
decrease assessment integrity. Expert review and rubric-
based evaluations help address these limitations by detect-
ing specific question design flaws that may skew assessment
outcomes [31, 8]. While researchers acknowledge that such
flaws influence item performance, a systematic examination
of how qualitative shortcomings in item design interact with
quantitative IRT measures across different domains remains
limited. Empirical evidence linking specific flaws to changes
in item discrimination and difficulty could clarify why cer-
tain questions perform poorly.
To address this gap, the present study integrates IRT analy-
sis with the standardized Item-Writing Flaws (IWF) rubric
[58]–an instrument for expert evaluation of MCQ quality.
To explore relationships between IRT- and IWF-based eval-
uations, we analyze datasets across diverse educational do-
mains: life and earth sciences, physical sciences, and mathe-
matics, encompassing the middle and high school grade lev-
els. These datasets combine 7,126 MCQs with response data
of 448,000 students within a large-scale online learning plat-
form. For each question, we compute difficulty and discrim-
ination parameters and automatically apply the 19-criterion
IWF rubric. By comparing these quantitative and qualita-
tive evaluations, we aim to demonstrate how specific design
flaws influence item performance across subject domains.
We investigate three primary research questions:
RQ1 How does the frequency of IWFs correlate with IRT
difficulty and discrimination parameters across MCQs
from different educational domains?arXiv:2503.10533v1 [cs.CL] 13 Mar 2025
Page 2:
RQ2 Which IWF criteria are most strongly associated with
low-quality questions, as indicated by IRT difficulty
and discrimination parameters?
RQ3 To what extent can IWF-based item features be used
to (i) predict IRT difficulty and discrimination param-
eters and (ii) filter out low-quality items?
Beyond generating insights into the connections between
IRT and IWFs, this paper introduces a general-purpose anal-
ysis methodology that leverages AI-enabled capabilities to
scale qualitative evaluations proposed in the learning sci-
ences literature [38, 32, 60, 1], alongside statistical measure-
ments derived from large-scale student data. In the present
context, this hybrid framework offers a more comprehen-
sive and rigorous assessment of MCQ quality, equipping ed-
ucators, test developers, and researchers with actionable in-
sights to design more effective and equitable assessments.
2. RELATED WORK
2.1 IRT-based Item Validation
Item validation is critical to ensure that assessments measure
intended constructs, have appropriate difficulty and discrim-
ination properties, and provide fair evaluations across di-
verse populations (e.g., gender and age groups) [21]. Within
the IRT framework, item validation involves conducting pi-
lot studies to collect sufficient student response data for reli-
able parameter estimation–an expensive and time-consuming
process [6]. To reduce the amount of student data needed
for reliable estimation, recent work proposed multi-armed
bandit algorithms as a more data-efficient approach to adap-
tively refine item parameters [53]. Additionally, to warrant
fairness and equity of assessments, differential item function-
ing (DIF) is analyzed to ensure that items do not advantage
or disadvantage any particular group of test-takers [17].
As an alternative to student data-driven validation meth-
ods, researchers have explored natural language processing
(NLP) techniques to predict IRT parameters based on an
item’s syntactic and semantic features [2]. Several studies
have applied neural networks, such as LSTMs and Trans-
formers, to analyze item text and estimate discrimination
and difficulty (e.g., [13, 9, 46]). These predictions can help
mitigate the cold-start problem, reducing the amount of stu-
dent response data needed for reliable parameter identifica-
tion [35]. In parallel with the development of the Duolingo
English Test, researchers have introduced methods to accel-
erate IRT parameter initialization, iterative calibration, and
assessment item validation [61, 54, 53].
The present study utilizes large-scale student response data
to estimate IRT parameters for 7,126 questions and employs
an automated approach that combines rule-based methods
and LLMs to annotate each question with a 19-criteria learn-
ing science rubric for Item-Writing Flaws (IWF) [37]. Unlike
prior work focused on improving IRT parameter prediction
from textual features, our study aims to enhance our un-
derstanding of relationships between IWF-based and IRT-
based item validation methods, providing insights into how
linguistic characteristics influence psychometric properties.2.2 Learning Science Rubrics
Rubrics play a central role in education by providing a struc-
tured means of evaluating quality, whether in student sub-
missions or instructional and assessment materials [3]. When
applied by trained individuals, rubrics help ensure consis-
tency and replicability by offering standardized and inter-
pretable evaluation criteria [28]. As a result, rubrics have
been used to assess the quality of student-facing resources,
including hints, short-answer questions, and multiple-choice
questions (MCQs) [43, 24, 40]. For example, Arif et al. [5]
employed six question-level metrics–including relevance, an-
swerability, and difficulty–to evaluate the quality of LLM-
generated MCQs. However, some rubric criteria involve a
degree of subjectivity, such as relevance , which may affect
inter-rater reliability and make replication more challenging.
Factors such as language preference, prior knowledge, and
personal definitions of difficulty may lead to inconsistencies
in applying the same criteria [55]. Additionally, even rela-
tively short rubrics can be time-consuming and cumbersome
to apply across large content pools, limiting scalability [22].
Despite the challenges of rubric-based evaluations, one promi-
nent instrument that has been widely adopted for MCQ as-
sessment is the Item-Writing Flaws (IWF) rubric [22, 58, 16].
Applicable across subject areas, the IWF rubric detects flaws
such as gratuitous detail, grammatical cues, and implausible
or disproportionately long distractors. Two previous stud-
ies in the domain of medical education have shown that the
presence of IWFs correlates with psychometric properties
such as difficulty and discrimination, with flawed items in-
troducing construct-irrelevant variance that can disadvan-
tage students and reduce test validity [18, 48]. Compared
to simpler automated measures (e.g., diversity, perplexity),
the IWF rubric provides a more targeted and pedagogically
grounded assessment of MCQ quality [37]. To address the
time-intensive nature of manually applying the IWF’s 19
distinct criteria to each MCQ, recent research has focused
on automation, enabling IWF rubric application at scale
[39, 37]. This automated approach achieved an overall accu-
racy of 94% on a dataset of 271 MCQs spanning five educa-
tional domains, each annotated with a gold-standard human
application of the IWF rubric. In addition to accelerating
the evaluation process, this method enhances consistency
and detail in assessments by mitigating some of the inher-
ent subjectivity in human-applied rubrics [42].
Rubrics are widely used in education, whether for grading
assignments or evaluating educational technologies, but they
often lack quantitative evidence to demonstrate their effec-
tiveness [26]. In this work, we address this gap by provid-
ing quantitative proof that the IWF rubric criteria influence
both question difficulty and discrimination. Unlike previ-
ous research that compared automatically identified IWF
criteria with human-applied labels, our approach relies on
an automated application that has already been validated
[39, 37]. Consequently, we apply these verified annotation
methods to thousands of real MCQs drawn from a variety
of domains, using student response data to go beyond mere
frequency counts. This enables us to offer deeper insights
into how specific IWFs differentially affect question quality.
Page 3:
Table 1: Dataset overview. The first three rows indicate the
number of concepts, questions and multiple-choice questions
(MCQs). The next three rows describe the number of stu-
dents, practice sessions and responses providing data to fit
the IRT parameters. The last two rows show the average
number of students and questions in each concept.
All Life/Earth Physical Math
# of concepts 1,033 563 336 134
# of questions 13,158 7,212 4,331 1,649
# of MCQs 7,126 3,792 2,206 1,128
# of students 448k 265k 169k 44k
# of sessions 1.9M 1.1M 0.6M 0.15M
# of responses 21.1M 12.6M 7.0M 1.6M
# stud./conc. 1,848 1,983 1,902 1,155
# quest./conc. 12.8 12.8 12.9 12.3
3. METHODOLOGY
3.1 Study Context and Dataset
Our analyses utilize a dataset from the CK-12 Foundation,
a US-based nonprofit that provides millions of students with
free access to educational resources. CK-12 actively devel-
ops and hosts the FlexBook 2.0 system1, an online tutoring
platform offering courses across diverse subjects and grade
levels. Each course consists of a series of concepts, analo-
gous to a textbook chapter and typically consisting of one to
four learning objectives. Each concept is associated with a
broader lesson topic and with a practice section designed to
develop and assess students’ understanding of that concept.
For instance, in a Life Science course, a lesson might be “Ge-
netics,” with “Punnett Squares” as a concept within it. We
focus on popular concepts within middle and high school
courses, spanning topics in physical sciences (e.g., physics,
chemistry), mathematics (e.g., algebra, geometry), and life
and earth sciences, using data from 2023 and 2024.
Overall, our study uses data from 448,000 students inter-
acting with 13,158 questions to fit IRT parameter sets for
1,033 distinct concepts (Table 1). All questions were writ-
ten by human domain experts. As the Item-Writing Flaw
(IWF) rubric studied in this paper is designed specifically for
multiple-choice questions (MCQs) [58], we assess the rela-
tionships between IWFs and question-specific difficulty and
discrimination parameters based on the 7,126 MCQs within
the content pool. The following discusses the IRT parameter
estimation and IWF annotation processes in detail.
3.2 Item Response Theory
Item Response Theory (IRT) is a methodological framework
commonly used in high-stakes assessments, such as college
entrance exams (e.g., SAT and GRE) [17]. Formally, IRT
models interactions between students and a set of test items
(i.e., questions) under binary response outcomes ( {0,1}).
The idea is to assign each student a latent ability parameter
that explains their response probabilities, estimated using
probabilistic inference. The relationship between student
ability and response correctness probabilities is modeled by
fitting a sigmoid function for each item, commonly referred
to as item response function (IRF).
1https://www.ck12.orgFor each item j, its IRF reaches its steepest slope at a spe-
cific point on the x-axis, representing the item’s difficulty δj.
The steepness of the IRF reflects the item’s discrimination
property, denoted as αj. Given student ability θi, along with
item difficulty and discrimination parameters, the probabil-
ity of student ianswering item jcorrectly is defined as
P(Xi,j= 1|θi, αj, δj) =1
1−e−αj(θi−δj), (1)
where Xi,jindicates the binary response outcome. Xis the
potentially sparse item response matrix capturing all inter-
actions between students and items. Given student response
data X, the parameters of the IRT model defined by Equa-
tion 1 are fitted via maximum likelihood estimation.
Our study utilizes the R package MIRT [15] to estimate a
separate set of IRT parameters for each of the 1,033 con-
cepts within our dataset. Following guidance from de Ay-
ala [17], we ensure a robust IRT parameter estimation by
focusing on concepts that meet the following criteria: at
least 500 students (each submitting a minimum of 5 re-
sponses) and at least 10 questions (each receiving a min-
imum of 500 responses). While we use data from multi-
ple question types (e.g., short-answer and multiple-choice)
for the initial parameter estimation process, the subsequent
analysis of the relationship between IWFs and IRT param-
eters focuses solely on MCQs (details in Table 1).
3.3 Item-Writing Flaws Application
We evaluate the quality of MCQs based on the 19-criteria
IWF rubric [58] (see Table 2). This learning science rubric is
domain-agnostic, and prior research has validated its utility
in medical education, mathematics, and science domains [22,
58, 16]. Given the immense resources required for domain
experts to annotate the more than 7,000 MCQs in our dataset,
we utilize the Scalable Automatic Question Usability Evalu-
ation Toolkit (SAQUET) [37], an open-source method that
facilitates the automated application of the IWF rubric.
SAQUET has been shown to closely align with expert hu-
man application of the rubric, achieving an overall accuracy
of 94% when applied to MCQs across five subject areas [37].
Compared to human evaluators, it is more likely to classify
an MCQ as having an IWF, making it a stricter tool that
errs on the side of caution. Additionally, the study demon-
strated that SAQUET offers a more comprehensive evalu-
ation of the quality of the question than traditional auto-
mated approaches, such as perplexity or cognitive complex-
ity. The toolkit combines rule-based approaches with Large
Language Model (LLM) verifications (via GPT-4o [25]) to
determine whether an item satisfies or violates each of the
rubric’s 19 criteria [39]. For surface-level flaws or those that
pertain to wording, such as identifying the presence of none
of the above options or vague terms , SAQUET employs rule-
based techniques that rely on verb tense detection, keyword
matching, and other straightforward heuristics. For crite-
ria requiring domain-specific or pedagogical judgment, such
as evaluating whether the text contains gratuitous informa-
tion, the system incorporates a final verification step using
the LLM to provide a “judgment call”. This step involves
prompting the LLM (GPT-4o) to confirm or refute the sus-
pected flaw based on the item’s content.
Page 4:
Table 2: Definitions of the 19-criteria within the Item-Writing Flaw (IWF) rubric [58].
IWF Criteria Definition
Ambiguous/Unclear The question text and options should be written in clear, precise language to avoid confusion
Implausible Distractors All incorrect answer choices should be realistic and plausible
None of the Above Avoid using any variation of “none of the above” since it primarily tests students’ ability to
identify wrong answers
Longest Option Correct The correct answer should not be noticeably longer or contain more detail than the other options,
as this can unintentionally guide students to the correct answer
Gratuitous Information Avoid unnecessary details in the question text that do not contribute to answering the question
True/False Question Avoid answer choices structured as a series of true or false statements
Convergence Cues Ensure answer choices do not contain overlapping words that might hint at the correct answer
Logical Cues Avoid clues in the stem and the correct option that can help the test-wise student to identify
the correct answer
All of the Above Avoid using any variation of “all of the above” as students can guess the correct answer just by
recognizing one correct option
Fill in the Blank Avoid missing words in the middle of a sentence, as this forces students to rely on partial
information
Absolute Terms Avoid extreme words like “always” or “never” in answer choices, as these are usually false
Word Repeats Ensure words or phrases from the question text are not repeated only in the correct answer, as
this can inadvertently reveal the right choice
Unfocused Stem The question text should be clear and specific so that students can understand it without needing
to read the answer choices
Complex or K-type Avoid overly complex questions that require selecting from a number of possible combinations
of the responses, as they may test strategy rather than knowledge
Grammatical Cues All options should be grammatically consistent with the question text and should be parallel in
style and form
Lost Sequence Arrange options in a logical order (e.g., chronological or numerical) to improve readability and
fairness
Vague Terms Avoid the use of vague words (e.g. frequently, occasionally) in the options as their meaning can
be subjective
More than One Correct In single-answer multiple-choice questions, ensure there is a single best answer to avoid ambiguity
Negative Wording Avoid usage of negative wording in the question text, as it can confuse students
The output of SAQUET is a labeled dataset in which each
item (i.e., MCQ) is annotated with a vector xi∈ {0,1}19
of binary indicators, specifying the presence or absence of a
specific flaw as characterized by the 19-criteria rubric.
3.4 Analysis Methodology
After using student data for IRT parameter estimation and
SAQUET for IWF rubric application, we define our analysis
dataset as D={(αi, δi,xi)}7126
i=1. Here, each MCQ iin the
content pool is characterized by its discrimination parameter
αi, difficulty parameter δi, and a binary vector xi∈ {0,1}19
indicating which flaws apply. We further define domain-
specific datasets ( DLife/Earth ,DPhysical ,DMath) to study po-
tential differences across subject areas (Table 1). Using
these datasets, we address our research questions through
a mixed methodology that combines traditional regression
analysis with modern machine learning algorithms.
For RQ1, we employ linear regression analysis to study how
the number of flaws relates to each MCQ’s difficulty and
discrimination parameters. In particular, we fit two models
δi=β0+β1∥xi∥1+ϵi, α i=γ0+γ1∥xi∥1+ηi(2)
where δiandαiare the difficulty and discrimination for
MCQ i. The predictor variable is the total number of iden-
tified flaws ∥xi∥1. The coefficients β1andγ1capture the
direction and magnitude of the association between the num-ber of flaws and each IRT parameter. The error terms ϵiand
ηiaccount for unexplained variance. By fitting these models
separately for the full dataset and each domain-specific sub-
set, we investigate how flaw prevalence influences difficulty
and discrimination across educational domains.
For RQ2, we use linear regression analysis to identify which
IWF criteria are most strongly associated with MCQ diffi-
culty and discrimination. Specifically, for each IWF rubric
f∈ {1, . . . , 19}we fit two models:
δi=β0,f+β1,fxi,f+ϵi,f, αi=γ0,f+γ1,fxi,f+ηi,f(3)
where xi,findicates the presence of flaw fin MCQ i. The
coefficients β1,fandγ1,fquantify the relationship between
each IWF criteria and the difficulty and discrimination pa-
rameters, respectively. The error terms ϵi,fandηi,faccount
for residual variance. By estimating these models, we exam-
ine the extent to which each IWF contributes to variations
in difficulty and discrimination across the datasets.
For RQ3, we investigate the extent to which IWF rubric-
based evaluations, derived solely from item text, can serve
as a proxy for traditional validation methods that require
student response data to estimate IRT parameters. Specif-
ically, we assess the predictive power of the flaw indicator
vector xiin two tasks: (i) predicting an item’s difficulty
(δi) and discrimination ( αi); and (ii) predicting items with
Page 5:
Table 3: Hyperparameters considered during model training.
Model Parameters
Regression penalty weight ∈ {10i}4
i=−4, penalty: l2
Random nestimators ∈ {50,100,200,300}
Forest max depth ∈ {None ,5,10,20}
min samples split∈ {2,5,10}
Gradient nestimators ∈ {50,100,200,300}
Boosting learning rate∈ {0.001,0.01,0.1,0.2,0.3}
max depth ∈ {3,5,7,10}
Multi-layer hidden layer sizes∈ {10,50,100}{1,2}
Perceptron activation ∈ {relu,tanh}
learning rate init∈ {10i}0
i=−4
low discrimination ( αi<0.5), low difficulty ( δi<2), and
high difficulty ( δi>2) [6]. To this end, we train machine
learning models to determine whether rubric-based flaw an-
notations provide sufficient predictive power to support au-
tomated item pre-screening across educational domains. We
do not train models for identifying high-discrimination ques-
tions, as high discrimination is not an item flaw.
Our evaluations consider a diverse range of parametric and
non-parametric machine learning algorithms, including lin-
ear/logistic regression, random forest, gradient boosting, and
multi-layer perceptron (MLP), using implementations from
the Python package scikit-learn [41]. For the regression
tasks, we evaluate model fit using root mean squared error
(RMSE) and assess predictive power using explained vari-
ance (R2) and Pearson correlation ( r). For the classifica-
tion tasks, we measure performance using accuracy (ACC),
area under the curve (AUC), and F1-score. Given the class
imbalance–where approximately 90% of items exhibit “be-
nign” IRT parameter values–AUC and F1-score are partic-
ularly relevant, as they provide a more robust evaluation of
model performance in imbalanced classification tasks.
Our results report average performance metrics across a 5-
fold cross-validation. In each fold, 80% of the items in the
dataset are used for model training and grid search-based hy-
perparameter selection, and 20% are used for testing. Thus,
all results are based solely on predictions for items that were
not observed during training. Table 3 outlines the hyperpa-
rameter spaces considered for each algorithm.
4. RESULTS
Using data from 448,000 students, we fitted IRT models for
each of the 1,033 concepts. Assessing discrimination and dif-
ficulty parameters of all 7,126 MCQs, we flagged 789 (11.1%)
for low discrimination, 773 (10.8%) for low difficulty, and
134 (1.9%) for high difficulty (Table 4). Across the domain-
specific datasets, we observed that Life/Earth Sciences and
Math showed the highest and lowest proportions of flagged
questions, respectively (low discrimination 12.1% vs. 7.3%,
low difficulty 10.5% vs. 4.3%, and high difficulty 7.3% vs.
1.6%). These findings suggest that Math MCQs within in-
dividual concepts have more homogeneous difficulty levels
compared to Science MCQs. Overall, we find that the vast
majority of MCQs exhibit desirable IRT parameters. This
implies that our item screening models have to manage class
imbalance when trying to predict whether an item has de-
sirable difficulty and discrimination (RQ3).Table 4: Analysis Overview. The first section details the total
number of questions and those flagged for low discrimination
and low/high difficulty based on IRT analysis. The second
section reports the total number of IWFs identified and aver-
age number per question. The last section highlights the five
most common IWFs and their prevalence across domains.
All Life/Earth Physical Math
# of questions 7,126 3,792 2,206 1,128
- low discrimination 773 459 232 82
- low difficulty 789 538 203 48
- high difficulty 134 78 38 18
# of IWFs 10,537 5,647 3,062 1,828
IWFs per quest. 1.479 1.489 1.388 1.621
ambiguous/unclear 31.3% 27.8% 30.0% 45.9%
fill in the blank 22.4% 29.2% 18.4% 7.8%
multiple correct 14.1% 14.3% 14.4% 12.7%
none of the above 12.5% 15.9% 10.8% 4.1%
lost sequence 10.1% 2.8% 13.6% 28.1%
In the IWF application, we found that most questions had
either no flaws or very few, with 82.5% containing at most
two (Figure 1). Among the three domains, Life/Earth Sci-
ences featured the highest proportion of flawless MCQs at
22.0%. Math exhibited the highest average number of IWFs
per question at 1.62. Still, all three domains demonstrated
a similar distribution of IWF numbers, with an overall av-
erage of 1.48 IWFs per question. Additional details are pro-
vided in Table 4, which highlights the prevalence of the five
most common IWFs within each domain. The most frequent
flaw identified across all domains was “ambiguous/unclear
language” in the question text or answer options, affecting
31.3% of all MCQs. We found “fill-in-the-blank” (fitb) and
“none-of-the-above” (nota) formulations to be more preva-
lent in the Life/Earth (29.2% fitb, 15.9% nota) and Phys-
ical Science (18.4% fitb, 10.8% nota) domains, compared
to Math (7.8% fitb, 4.1% nota). The ”lost-sequence” flaw,
which indicates that answer options break chronological or
numerical order, was significantly more common in Math
MCQs at 28.1%. We continue by assessing the impact of
IWF numbers and specific IWF criteria on MCQ’s IRT dif-
ficulty and discrimination parameters (RQ1 and RQ2).
0 1 2 3 4 5 6
# of IWFs020040060080010001200# of questionsLife/Earth
0 1 2 3 4 5 6
# of IWFsPhysical
0 1 2 3 4 5 6
# of IWFsMath
Figure 1: Histograms illustrating the number of IWFs iden-
tified per question across the three domain-specific datasets.
4.1 RQ1: IWF Correlations with IRT
We conducted regression analyses to study how the number
of IWFs relates to IRT discrimination and difficulty param-
eters across aggregated and domain-specific datasets (Ta-
ble 5). First, focusing on the aggregated dataset containing
all 7,126 MCQs, we observe a significant negative relation-
ship between IWF frequencies and discrimination parame-
ters (ˆ γ1=−0.080, p < 0.001), indicating that items with
Page 6:
1.0
0.5
0.0 0.5 1.0
Discrim. Coeff. (1,f)
ambiguous/unclearimplausible distractorsnone of the abovelongest option correctgratuitous infotrue/false questionconvergence cueslogical cuesall of the above fill-in-blank absolute termsword repeatsunfocused stemcomplex or k-typegrammatical cueslost sequencevague termsmore than one correctnegative wordedIWF Criteria
All
1.0
0.5
0.0 0.5 1.0
Discrim. Coeff. (1,f)
Life/Earth
1.0
0.5
0.0 0.5 1.0
Discrim. Coeff. (1,f)
Physical
1.0
0.5
0.0 0.5 1.0
Discrim. Coeff. (1,f)
Math
1
0 1
Difficulty Coeff. (1,f)
ambiguous/unclearimplausible distractorsnone of the abovelongest option correctgratuitous infotrue/false questionconvergence cueslogical cuesall of the above fill-in-blank absolute termsword repeatsunfocused stemcomplex or k-typegrammatical cueslost sequencevague termsmore than one correctnegative wordedIWF Criteria
1
0 1
Difficulty Coeff. (1,f)
1
0 1
Difficulty Coeff. (1,f)
1
0 1
Difficulty Coeff. (1,f)
Impact of IWF Criteria on IRT Discrimination/Difficulty
Significant (p < 0.05) Not Significant (p 0.05)
Figure 2: Linear regression analysis examining the strength of association between each IWF criterion and IRT discrimination and
difficulty parameters across the domain-specific datasets. The figure indicates estimated coefficients, 95% confidence intervals,
and highlights statistically significant relationships ( p <0.05) in green.
Table 5: Linear regression analysis examining relationships
between the number of IWFs and IRT discrimination and
difficulty parameters across domains. We report estimated
coefficients, 95% conf. intervals, and corresponding p-values.
Parameter All Life/Earth Physical Math
Discrimination -0.080 -0.075 -0.139 -0.016
(±0.011) ( ±0.012) ( ±0.020) ( ±0.037)
p = 0.000 p = 0.000 p = 0.000 p = 0.393
Difficulty -0.042 -0.093 0.019 0.030
(±0.025) ( ±0.036) ( ±0.045) ( ±0.050)
p = 0.001 p = 0.000 p = 0.417 p = 0.239
higher discrimination were less likely to contain IWFs. This
pattern was consistent across Life/Earth (ˆ γ1=−0.075,p <
0.001) and Physical Sciences (ˆ γ1=−0.139,p <0.001), sug-
gesting that well-discriminating items in these domains were
generally written with fewer flaws. The relationship between
IWF frequencies and difficulty parameters showed mixed re-
sults. In Life/Earth, the domain with the most questions,
there was a significant negative association ( ˆβ1=−0.093,p < 0.001), indicating that easier items were more prone
to contain flaws. However, in Physical Sciences and Math,
we did not find significant relationships between difficulty
parameters and IWF frequencies ( p= 0.417 and p= 0.239,
respectively). This suggests that in these domains, the num-
ber of IWFs may not be a reliable predictor of item difficulty.
4.2 RQ2: Identifying High-Impact IWFs
The second regression analysis aimed to identify which spe-
cific IWF criteria are most strongly associated with question
discrimination and difficulty parameters. For each dataset
and IWF criteria f∈ {1, . . . , 19}, Figure 2 presents the esti-
mated discrimination (ˆ γ1,f) and difficulty coefficients ( ˆβ1,f)
along with their 95% confidence intervals. Statically signif-
icant coefficients ( p < 0.05) are highlighted in green. Ex-
amining the combined dataset of 7,126 MCQs, we found
significant associations between IRT discrimination and 15
of the 19 IWF criteria, while 13 criteria were significantly
associated with difficulty. Among domain-specific datasets,
Math exhibited the highest number of significant discrimina-
tion coefficients (12), despite having the smallest sample size
Page 7:
Table 6: Regression Task. We train models that employ IWF features to predict MCQ’s discrimination and difficulty parameters.
All Pearson correlation coefficients ( r) are statistically significant at p <0.001.
Parameter All Life/Earth Physical Math
RMSE R2r RMSE R2r RMSE R2r RMSE R2r
Discrimination
Lin. Regr. 0.491 0.121 0.348 0.396 0.129 0.359 0.476 0.178 0.422 0.666 0.071 0.269
Rnd. Forest 0.487 0.138 0.373 0.396 0.131 0.362 0.475 0.185 0.430 0.666 0.071 0.268
Grad. Boost. 0.485 0.142 0.377 0.395 0.132 0.364 0.472 0.192 0.439 0.666 0.071 0.268
MLP 0.486 0.141 0.375 0.396 0.130 0.361 0.474 0.187 0.433 0.666 0.072 0.273
Difficulty
Lin. Regr. 1.128 0.115 0.338 1.161 0.189 0.435 1.099 0.102 0.319 0.914 0.017 0.141
Rnd. Forest 1.067 0.209 0.457 1.109 0.259 0.510 1.054 0.174 0.420 0.917 0.012 0.156
Grad. Boost. 1.064 0.213 0.462 1.111 0.257 0.507 1.062 0.161 0.405 0.914 0.018 0.136
MLP 1.062 0.216 0.465 1.106 0.264 0.514 1.043 0.192 0.438 0.908 0.030 0.176
(N= 1,128), suggesting stronger associations with IWF cri-
teria compared to Earth/Life (9) and Physical Sciences (11).
In contrast, difficulty coefficients were more frequently sig-
nificant for Life/Earth (10) and Physical Sciences (10) than
for Math (6), highlighting differences in how IWFs impact
IRT parameters across educational domains.
Shifting our focus on individual IWFs, we found that the
flaws most negatively associated with IRT discrimination
and difficulty parameters were“longest option correct”(ˆ γ1,f=
−0.370, ˆβ1,f=−0.691), “more than one correct” (ˆ γ1,f=
−0.366, ˆβ1,f=−0.928), and “all of the above” (ˆ γ1,f=
−0.322, ˆβ1,f=−0.806). These flaws likely introduce textual
cues that inadvertently hint at the correct answer, dimin-
ishing the quality of test items. We observed that the “lost
sequence” criteria had a positive discrimination coefficient
(ˆγ1,f= 0.314) and was also significant for two of the three
domain-specific datasets. Several IWFs were associated with
increased question difficulty, including “convergence cues”
(ˆβ1,f= 0.679), “grammatical cues” ( ˆβ1,f= 0.526) and “neg-
ative wording” ( ˆβ1,f= 0.454), suggesting that these flaws
may contribute to cognitive load or confusion beyond the
intended subject knowledge assessment.
4.3 RQ3: IWF-Based IRT Predictions
Using the IWF annotations as input features, we trained
machine learning models to predict questions’ difficulty and
discrimination parameters. The performance of the result-
ing models, as shown in Table 6, varied across educational
domains and predicted parameters. For the discrimination
parameter, when trained on the dataset comprising all 7,126
MCQs, the models achieved Pearson correlation coefficients
(r) ranging from 0.348 to 0.377 and explained variance ( R2)
ranging from 0.121 to 0.141, indicating moderate predictive
strength. For the difficulty parameter, the Random For-
est and MLP models showed the highest Pearson correla-
tions ( r= 0.457 and r= 0.465, respectively) and explained
variance ( R2= 0.209 and R2= 0.216, respectively), sug-
gesting more effective utilization of the IWF features. No-
tably, non-linear models (Random Forest, Gradient Boost-
ing, and MLP) consistently outperformed the linear regres-
sion model, indicating that modeling non-linear interactions
between IWF features can improve predictive accuracy. We
observed substantial differences between the domain-specific
models. For instance, in Life/Earth sciences, the RandomTable 7: Classification task. We train models that employ
IWF features to predict MCQs with low discrimination and
low/high difficulty. To highlight class imbalance, we include
a baseline assigning all MCQs to the majority class in gray.
Task Life/Earth Physical
ACC AUC F1 ACC AUC F1
Disc. Low 0.879 0.500 0.000 0.895 0.500 0.000
Log. Regr. 0.880 0.736 0.249 0.909 0.784 0.435
Rnd. Forest 0.890 0.746 0.344 0.907 0.781 0.403
Grad. Boost. 0.888 0.741 0.354 0.910 0.799 0.432
MLP 0.882 0.730 0.364 0.908 0.779 0.400
Diff. Low 0.858 0.500 0.000 0.908 0.500 0.000
Log. Regr. 0.910 0.818 0.649 0.934 0.778 0.516
Rnd. Forest 0.910 0.809 0.636 0.932 0.760 0.498
Grad. Boost. 0.910 0.825 0.644 0.933 0.784 0.506
MLP 0.908 0.808 0.639 0.932 0.757 0.514
Diff. High 0.979 0.500 0.000 0.983 0.500 0.000
Log. Regr. 0.979 0.684 0.000 0.983 0.789 0.000
Rnd. Forest 0.979 0.706 0.000 0.983 0.618 0.000
Grad. Boost. 0.979 0.688 0.000 0.983 0.727 0.000
MLP 0.979 0.681 0.021 0.983 0.675 0.000
Forest model achieved the highest Pearson correlation ( r=
0.510) and explained variance ( R2= 0.259). However, in
Math, all models struggled with both discrimination (max
R2= 0.071) and difficulty predictions (max R2= 0.030).
We evaluate the utility of IWF features for predicting MCQs
with low discrimination and low/high difficulty in Life/Earth
and Physical Science datasets, where the prior regression
analysis confirmed the predictive power of IWFs. Table 7
shows that models trained on IWF features achieve AUC
scores of up to 0.746 (random forest) and 0.799 (gradient
boosting) for low discrimination, and 0.825 (gradient boost-
ing) and 0.784 (logistic regression) for low difficulty. While
the AUC scores suggest strong predictive performance, F1
scores remain relatively low for low-discrimination MCQs
(peaking at 0.364 for Life/Earth and 0.435 for Physical Sci-
ences), indicating challenges due to class imbalance (Ta-
ble 4). In contrast, F1 scores for low-difficulty questions are
considerably higher, with logistic regression achieving 0.649
for Life/Earth and 0.516 for Physical Sciences, suggesting
that IWFs are particularly informative for identifying low-
difficulty MCQs. In contrast, none of the classifiers trained
to predict high difficulty MCQs outperformed a baseline that
Page 8:
0.0 0.2 0.4 0.6 0.8 1.0
Classification Threshold0.00.20.40.60.81.0Score
0.62Life/Earth: Precision/Recall of Low Diff. MCQ Detection
Log. Regr. Recall
Rnd. Forest Recall
Grad. Boost. Recall
MLP Recall
Log. Regr. Precision
Rnd. Forest Precision
Grad. Boost. Precision
MLP PrecisionFigure 3: Precision and recall curves for predicting low-
difficulty Life/Earth Science MCQs for different classifiers.
The curves show trade-off between precision and recall across
classification thresholds. Using a threshold of 0.62, logistic re-
gression achieves a precision of 0.801 and a recall of 0.472.
always predicts the majority class. This is likely due to class
imbalance and the fact that IWFs are not designed to assess
the knowledge required to answer domain-specific questions.
Since the IWF-based classification models demonstrated the
highest predictive performance for identifying low-difficulty
MCQs in the Life/Earth Science dataset, we conducted a
follow-up analysis to assess their potential for automated
item pre-screening. Figure 3 illustrates the trade-offs be-
tween precision and recall across different classification thresh-
olds. Recall represents the proportion of low-difficulty MCQs
correctly identified by the models, while precision reflects the
fraction of flagged MCQs that genuinely belong to the low-
difficulty category. By setting the classification threshold
to 0.62, the logistic regression model achieves a high preci-
sion of 0.801 while maintaining a moderate recall of 0.472.
This balance underscores the practical utility of IWF-based
classifiers in supporting experts in test item development
by enabling early identification of low-difficulty questions,
potentially lowering the need for student data collection.
5. DISCUSSION
Our study integrated statistical and machine learning meth-
odologies with large-scale student data, capturing interac-
tions with thousands of questions. This approach provided
insights into how qualitative aspects of question design in-
fluence traditional measures of question performance derived
from item response theory (IRT) [17]. Specifically, we ex-
amined the relationships between the standard 19-criteria
IWF rubric [58] for MCQs and IRT parameters across var-
ious educational domains (e.g., math and natural sciences).
Our findings offer quantitative evidence demonstrating how
the frequency and specific types of IWFs impact question
discrimination and difficulty. Additionally, we validated the
utility of IWF evaluations as features for predicting IRT pa-
rameters and for identifying low-difficulty questions.
Across the three domains we examined, the frequency ofIWFs consistently relates to item discrimination, yet its re-
lationship to item difficulty appears to be domain-specific.
MCQs with fewer flaws tended to show higher discrimina-
tion, indicating that IWFs can diminish a question’s reliabil-
ity. As noted in prior work, certain IWFs may inadvertently
aid students in guessing the correct answer (e.g., “longest
answer correct” or “all of the above”) [58, 22], while oth-
ers add confusion unrelated to the content itself. In either
case, such flaws can distort how accurately the question dis-
criminates between more- and less-knowledgeable students.
In Life/Earth Sciences, we observed that easier items con-
tained more flaws, suggesting the frequent presence of the
flaws that may effectively simplify questions; however, this
trend did not surface in Physical Sciences or Math. This
discrepancy underscores the possibility that IWFs and item
difficulty interact in a domain-dependent manner. It may
also reflect variations in how effectively the automated IWF
detection methods operate across different subject areas,
aligning with previous findings that showed stronger perfor-
mance in Humanities and Healthcare than in Chemistry [37].
Consequently, while IWF frequency appears to be a reliable
indicator of item discrimination overall, its utility in predict-
ing item difficulty likely hinges on both the domain and the
strengths or limitations of automated detection techniques.
Our findings indicate that most IWF criteria significantly in-
fluence both item discrimination and difficulty, though cer-
tain flaws exhibit particularly strong and consistent effects.
Specifically, flaws such as ”longest option correct” and ”all
of the above” show the most substantial negative associa-
tions with both metrics. This is likely because they intro-
duce cues that enable students to guess the correct answer
without engaging with the intended knowledge. In contrast,
the ”lost sequence” flaw had a positive effect on discrimi-
nation across multiple datasets, suggesting that sequence-
based tasks may require more focused reasoning skills, thus
better distinguishing between higher- and lower-performing
students. Additionally, flaws such as ”convergence cues”,
”grammatical cues”, and ”negative wording” were associated
with higher item difficulty. This suggests that these flaws
may elevate cognitive load by requiring students to navigate
complex text structures rather than directly demonstrating
their domain knowledge. Consistent with previous work, the
presence of ”all of the above” as an answer choice decreased
the difficulty of the question [48] While some flaws consis-
tently diminish both question quality and rigor, our findings
highlight how specific IWFs exert their influence differently.
They might make questions easier to guess or introduce ad-
ditional cognitive demands that may confuse students. Oth-
ers appear to have more nuanced effects, warranting further
investigation into their role in shaping assessment validity
and fairness.
Machine learning models trained to predict question discrim-
ination and difficulty parameters based on IWF annotations
achieve moderate predictive power, with performance vary-
ing across subject domains (Table 6). Notably, prediction
accuracy was higher for Life/Earth and Physical Science
questions compared to Math, particularly for difficulty es-
timation. This aligns with our regression analysis, which
revealed stronger associations between IWFs and difficulty
parameters in the Science domains (Table 2). Across all
prediction tasks, non-linear models (e.g., Gradient Boosting
Page 9:
and MLP) consistently outperformed linear models, high-
lighting the need to capture complex interactions between
individual IWF features. To assess the practical utility of
IWF features for item screening, we evaluated classification
models designed to identify questions with low discrimina-
tion, low difficulty, and high difficulty. Our results suggest
that by selecting a classification threshold that balances pre-
cision and recall, IWF-based models can assist domain ex-
perts in identifying low-difficulty questions early. In partic-
ular, criteria such as ”all of the above” and ”longest option
correct” showed strong associations with low item difficulty,
likely explaining why our classifiers performed significantly
better at predicting low-difficulty MCQs compared to high-
difficulty ones. The latter task likely requires models to
assess the specific knowledge needed to answer a question
within a given domain, underscoring the limitations of IWF
features for difficulty prediction.
By examining how qualitative question design guidelines [58]
align with robust statistical measurements derived from large
student datasets, our study contributes to ongoing research
efforts on characterizing effective instructional design prin-
ciples [29]. From a learning science perspective, rubrics
serve as distilled representations of expert knowledge used
to assess the quality of educational materials and instruc-
tion (e.g., [44, 59, 32]). Understanding the relationship be-
tween expert evaluation rubrics and student learning is cru-
cial, especially as AI-driven learning technologies increas-
ingly rely on textual descriptions of effective pedagogical
strategies [56, 52, 45, 27].
6. LIMITATIONS AND FUTURE WORK
While this study established relationships between a domain-
general IWF rubric [58] and statistical measures of question
difficulty and discrimination derived from IRT [17], several
limitations should be considered. First, our analysis was lim-
ited to science and mathematics courses within a large-scale
online tutoring platform at the middle and high school lev-
els. Future research should explore the applicability of these
findings across other subject areas, including language, hu-
manities, and social sciences. Additionally, further valida-
tion is needed in higher and professional education contexts,
particularly in medical education, where MCQ-based assess-
ments are widely used [18, 48]. Finally, beyond education,
IWFs may influence the reliability of other assessments, such
as psychological evaluations of personality traits and mental
states, where MCQs play a central role [49]. Investigating
these broader implications would enhance our understand-
ing of IWFs across diverse testing environments.
Although the IWF rubric provides a domain-general method-
ology for experts to assess the pedagogical soundness of test
items without relying on student data [58], our findings indi-
cate that its features are only moderate predictors of MCQ
difficulty and discrimination parameters (Table 6). By de-
sign, IWFs focus on broad design principles, such as en-
suring that all distractors are plausible, but do not capture
domain-specific nuances related to the knowledge required
to solve a particular test item. An item may fully adhere to
IWF guidelines yet still exhibit high or low difficulty levels
depending on the complexity of the subject knowledge it as-
sesses. To address these limitations, future research could
explore hybrid approaches that combine the interpretabilityof IWF-based evaluations with the predictive power of deep
learning models, which estimate IRT parameters based on
semantic analyses of question text [2]. Another promising
direction is the development of enhanced evaluation rubrics
that integrate human domain expertise with data-driven in-
sights generated by machine learning algorithms, thereby
improving their predictive accuracy [30, 34, 7].
Across the courses examined in this study, the IWF anal-
ysis identified an average of 1.48 writing flaws per MCQ
(Table 4). While many of these flaws may have minimal
impact on student learning outcomes, addressing them re-
mains essential for ensuring content quality. Future work
will focus on developing AI-assisted content authoring tools
to support domain experts in MCQ generation and refine-
ment [37]. Recent advancements in LLM-enabled pipelines
for question generation and validation offer promising di-
rections [10, 33, 23]. To enhance the efficiency of question
validation, future research will explore natural language pro-
cessing and reinforcement learning algorithms to reduce the
amount of student response data required for reliable IRT
parameter identification [35, 61, 54, 53].
Lastly, we emphasize the broader utility of evaluation method-
ologies that integrate generative AI to scale qualitative as-
sessments based on learning science rubrics with statistical
measures derived from student data. This hybrid approach
can generate robust and actionable insights for improving
educational practice. Future research will extend this frame-
work to evaluate other types of educational materials, such
as hints [44, 51, 57], textbooks [59], and illustrations [4]. Ad-
ditional directions include examining the predictive validity
of rubric-based evaluations in educational domains such as
project-based learning [14, 20, 1], discourse analysis [32, 12]
and programming education [14, 50].
7. CONCLUSION
This paper explored relationships between the 19-criteria
Item-Writing Flaws (IWFs) rubric, a domain-general quali-
tative method for question validation [58], and item response
theory (IRT), a traditional, data-driven approach to assess-
ing question quality [17]. Using an automated method, we
applied the rubric to over 7,000 multiple-choice questions
spanning mathematics, physical sciences, and life/earth sci-
ence domains, analyzing how the number and types of IWFs
impact question difficulty and discrimination parameters.
Three key findings emerged. First, a higher number of IWFs
was associated with lower item difficulty and discrimination
in life/earth sciences, while the relationship was less con-
sistent in mathematics and physical sciences. Second, spe-
cific IWF criteria strongly correlated with question difficulty,
such as “longest option correct” for easier items and “con-
vergence cues” for harder ones, demonstrating how superfi-
cial textual cues can compromise an otherwise well-designed
question. Third, while models trained on IWF features did
not match the precision of IRT-based methods, they showed
promise for preliminary screening, particularly in identifying
low-difficulty questions.
These findings show the dual role of domain-agnostic and
domain-specific factors in developing high-quality test items.
On the one hand, a rubric that flags generic writing flaws
can serve as a scalable “first pass”, helping content authors
Page 10:
identify potential design issues before pilot testing. On the
other hand, IWF features alone are only moderate predictors
of IRT parameters, with predictive strength varying across
educational domains. This highlights that IWF-based eval-
uation cannot replace traditional student data-dependent
methods, such as those embodied in IRT. Future work could
explore hybrid approaches that integrate the interpretability
of human-readable rubrics with the flexibility of machine-
learning models capable of capturing semantic information
related to domain-specific knowledge to enhance the accu-
racy of IRT parameter predictions. This systematic align-
ment of qualitative rubrics with quantitative validation not
only helps improve item quality at scale but also ensures
that computer-assisted assessments support fair, reliable,
and pedagogically meaningful testing.
Acknowledgments
We thank Microsoft for support in the form of Azure com-
puting and access to the OpenAI API through a grant from
their Accelerate Foundation Model Academic Research Pro-
gram. We thank the CK-12 Foundation (ck12.org) for pro-
viding access to their learning materials and to data on stu-
dent responses to those materials. This research was sup-
ported in part by the AFOSR under award FA95501710218.
References
[1] G. Aher, R. Schmucker, T. Mitchell, and Z. C. Lip-
ton. Ai mentors for student projects: Spotting early
issues in computer science proposals. arXiv preprint
arXiv:2503.05782 , 2025.
[2] S. AlKhuzaey, F. Grasso, T. R. Payne, and V. Tamma.
Text-based question difficulty prediction: A systematic
review of automatic approaches. International Journal
of Artificial Intelligence in Education , 34(3):862–914,
2024.
[3] D. Allen and K. Tanner. Rubrics: Tools for making
learning goals and evaluation criteria explicit for both
teachers and learners. CBE—Life Sciences Education ,
5(3):197–203, 2006.
[4] A. Angra and S. M. Gardner. The graph rubric: De-
velopment of a teaching, learning, and research tool.
CBE—Life Sciences Education , 17(4):ar65, 2018.
[5] T. Arif, S. Asthana, and K. Collins-Thompson. Gener-
ation and assessment of multiple-choice questions from
video transcripts using large language models. In Pro-
ceedings of the Eleventh ACM Conference on Learning@
Scale , pages 530–534, 2024.
[6] F. B. Baker. The basics of item response theory . ERIC,
2001.
[7] A. Barany, N. Nasiar, C. Porter, A. F. Zambrano, A. L.
Andres, D. Bright, M. Shah, X. Liu, S. Gao, J. Zhang,
et al. Chatgpt for education research: exploring the
potential of large language models for qualitative code-
book development. In International conference on arti-
ficial intelligence in education , pages 134–149. Springer,
2024.
[8] S. P. Bates, R. K. Galloway, J. Riise, and D. Homer.
Assessing the quality of a student-generated questionrepository. Physical Review Special Topics-Physics Ed-
ucation Research , 10(2):020105, 2014.
[9] L. Benedetto. A quantitative study of nlp approaches to
question difficulty estimation. In International Confer-
ence on Artificial Intelligence in Education , pages 428–
434. Springer, 2023.
[10] S. Bhandari, Y. Liu, Y. Kwak, and Z. A. Pardos. Evalu-
ating the psychometric properties of chatgpt-generated
questions. Computers and Education: Artificial Intelli-
gence , 7:100284, 2024.
[11] R. J. Boland, N. A. Lester, and E. Williams. Writ-
ing multiple-choice questions. Academic Psychiatry ,
34:310–316, 2010.
[12] C. Borchers, K. Yang, J. Lin, N. Rummel, K. R.
Koedinger, and V. Aleven. Combining dialog acts and
skill modeling: What chat interactions enhance learn-
ing rates during ai-supported peer tutoring? In Pro-
ceedings of the 17th International Conference on Edu-
cational Data Mining , 2024.
[13] M. Byrd and S. Srivastava. Predicting difficulty and
discrimination of natural language questions. In Pro-
ceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Pa-
pers) , pages 119–130, 2022.
[14] V. Catet´ e, E. Snider, and T. Barnes. Developing a
rubric for a creative cs principles lab. In Proceedings
of the 2016 ACM Conference on Innovation and Tech-
nology in Computer Science Education , pages 290–295,
2016.
[15] R. P. Chalmers. mirt: A multidimensional item re-
sponse theory package for the r environment. Journal
of statistical Software , 48:1–29, 2012.
[16] E. Costello, J. Holland, and C. Kirwan. The future
of online testing and assessment: question quality in
moocs. International Journal of Educational Technol-
ogy in Higher Education , 15(1):1–14, 2018.
[17] R. J. De Ayala. The theory and practice of item re-
sponse theory . Guilford, New York, NY, USA, 2013.
[18] S. M. Downing. The effects of violating standard item
writing principles on tests and students: the conse-
quences of using flawed test items on achievement ex-
aminations in medical education. Advances in health
sciences education , 10:133–143, 2005.
[19] A. H. Elgadal and A. A. Mariod. Item analysis of
multiple-choice questions (mcqs): assessment tool for
quality assurance measures. Sudan Journal of Medical
Sciences , 16(3):334–346, 2021.
[20] M. Goyal, C. Gupta, and V. Gupta. A meta-analysis ap-
proach to measure the impact of project-based learning
outcome with program attainment on student learning
using fuzzy inference systems. Heliyon , 8(8), 2022.
[21] T. Haladyna. Developing and validating test items .
Routledge, 2013.
Page 11:
[22] T. M. Haladyna, S. M. Downing, and M. C. Rodriguez.
A review of multiple-choice item-writing guidelines for
classroom assessment. Applied measurement in educa-
tion, 15(3):309–333, 2002.
[23] J. He-Yueya, N. D. Goodman, and E. Brunskill.
Evaluating and optimizing educational content with
large language model judgments. arXiv preprint
arXiv:2403.02795 , 2024.
[24] A. Horbach, I. Aldabe, M. Bexte, O. L. de Lacalle, and
M. Maritxalar. Linguistic appropriateness and peda-
gogic usefulness of reading comprehension questions.
InProceedings of the Twelfth Language Resources and
Evaluation Conference , pages 1753–1762, 2020.
[25] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman,
A. Ramesh, A. Clark, A. Ostrow, A. Welihinda,
A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv
preprint arXiv:2410.21276 , 2024.
[26] G. Janssen, V. Meier, and J. Trace. Building a bet-
ter rubric: Mixed methods rubric revision. Assessing
writing , 26:51–66, 2015.
[27] I. Jurenka, M. Kunesch, K. R. McKee, D. Gillick,
S. Zhu, S. Wiltberger, S. M. Phal, K. Hermann,
D. Kasenberg, A. Bhoopchand, et al. Towards re-
sponsible development of generative ai for educa-
tion: An evaluation-driven approach. arXiv preprint
arXiv:2407.12687 , 2024.
[28] V. Kind. Development of evidence-based, student-
learning-oriented rubrics for pre-service science teach-
ers’ pedagogical content knowledge. International Jour-
nal of Science Education , 41(7):911–943, 2019.
[29] K. R. Koedinger, J. L. Booth, and D. Klahr. Instruc-
tional complexity and the science to constrain it. Sci-
ence, 342(6161):935–937, 2013.
[30] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann,
E. Pierson, B. Kim, and P. Liang. Concept bottle-
neck models. In International conference on machine
learning , pages 5338–5348. PMLR, 2020.
[31] G. Kurdi, J. Leo, B. Parsia, U. Sattler, and S. Al-Emari.
A systematic review of automatic question generation
for educational purposes. International Journal of Ar-
tificial Intelligence in Education , 30:121–204, 2020.
[32] X. Liu, J. Zhang, A. Barany, M. Pankiewicz, and R. S.
Baker. Assessing the potential and limits of large lan-
guage models in qualitative coding. In International
Conference on Quantitative Ethnography , pages 89–103.
Springer, 2024.
[33] Y. Liu, S. Bhandari, and Z. A. Pardos. Leveraging llm-
respondents for item evaluation: a psychometric analy-
sis.arXiv preprint arXiv:2407.10899 , 2024.
[34] J. M. Ludan, Q. Lyu, Y. Yang, L. Dugan, M. Yatskar,
and C. Callison-Burch. Interpretable-by-design text
classification with iteratively generated concept bottle-
neck. arXiv preprint arXiv:2310.19660 , 2023.[35] A. D. McCarthy, K. P. Yancey, G. T. LaFlair, J. Egbert,
M. Liao, and B. Settles. Jump-starting item parame-
ters for adaptive language tests. In Proceedings of the
2021 conference on empirical methods in natural lan-
guage processing , pages 883–899, 2021.
[36] P. McCoubrie. Improving the fairness of multiple-
choice questions: a literature review. Medical teacher ,
26(8):709–712, 2004.
[37] S. Moore, E. Costello, H. A. Nguyen, and J. Stam-
per. An automatic question usability evaluation toolkit.
InInternational Conference on Artificial Intelligence in
Education , pages 31–46. Springer, 2024.
[38] S. Moore, H. A. Nguyen, N. Bier, T. Domadia, and
J. Stamper. Assessing the quality of student-generated
short answer questions using gpt-3. In European con-
ference on technology enhanced learning , pages 243–257.
Springer, 2022.
[39] S. Moore, H. A. Nguyen, T. Chen, and J. Stamper.
Assessing the quality of multiple-choice questions using
gpt-4 and rule-based methods. In European Confer-
ence on Technology Enhanced Learning , pages 229–245.
Springer, 2023.
[40] N. Mulla and P. Gharpure. Automatic question gener-
ation: a review of methodologies, datasets, evaluation
metrics, and applications. Progress in Artificial Intelli-
gence , 12(1):1–32, 2023.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, et al. Scikit-learn: Machine
learning in python. the Journal of machine Learning
research , 12:2825–2830, 2011.
[42] M. J. Peeters. Measuring rater judgments within learn-
ing assessments—part 2: A mixed approach to creating
rubrics. Currents in Pharmacy Teaching and Learning ,
7(5):662–668, 2015.
[43] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle,
V. Catet´ e, and T. Barnes. A comparison of the qual-
ity of data-driven programming hint generation algo-
rithms. International Journal of Artificial Intelligence
in Education , 29:368–395, 2019.
[44] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle,
V. Catet´ e, and T. Barnes. A comparison of the qual-
ity of data-driven programming hint generation algo-
rithms. International Journal of Artificial Intelligence
in Education , 29:368–395, 2019.
[45] R. Puech, J. Macina, J. Chatain, M. Sachan, and
M. Kapur. Towards the pedagogical steering of
large language models for tutoring: A case study
with modeling productive failure. arXiv preprint
arXiv:2410.03781 , 2024.
[46] D. Reyes, A. Jimenez, P. Dartnell, S. Lions, and S. R´ ıos.
Multiple-choice questions difficulty prediction with neu-
ral networks. In International Conference in Methodolo-
gies and intelligent Systems for Techhnology Enhanced
Learning , pages 11–22. Springer, 2023.
Page 12:
[47] T. Rusch, P. B. Lowry, P. Mair, and H. Treiblmaier.
Breaking free from the limitations of classical test the-
ory: Developing and measuring information systems
scales using item response theory. Information & Man-
agement , 54(2):189–203, 2017.
[48] B. R. Rush, D. C. Rankin, and B. J. White. The impact
of item-writing flaws and item complexity on exami-
nation item difficulty and discrimination value. BMC
medical education , 16:1–10, 2016.
[49] J. Rust and S. Golombok. Modern psychometrics: The
science of psychological assessment . Routledge, 2014.
[50] D. Saito, R. Yajima, H. Washizaki, and Y. Fukazawa.
Validation of rubric evaluation for programming educa-
tion. Education Sciences , 11(10):656, 2021.
[51] R. Schmucker, N. Pachapurkar, S. Bala, M. Shah, and
T. Mitchell. Learning to give useful hints: Assistance
action evaluation and policy improvements. In Respon-
sive and Sustainable Educational Futures , pages 383–
398, Cham, 2023. Springer Nature Switzerland.
[52] R. Schmucker, M. Xia, A. Azaria, and T. Mitchell.
Ruffle&riley: Insights from designing and evaluating
a large language model-based conversational tutoring
system. In Artificial Intelligence in Education , pages
75–90, Cham, 2024. Springer Nature Switzerland.
[53] J. Sharpnack, K. Hao, P. Mulcaire, K. Bicknell,
G. LaFlair, K. Yancey, and A. A. von Davier. Bandit-
cat and autoirt: Machine learning approaches to com-
puterized adaptive testing and item calibration. arXiv
preprint arXiv:2410.21033 , 2024.
[54] J. Sharpnack, P. Mulcaire, K. Bicknell, G. LaFlair, and
K. Yancey. Autoirt: Calibrating item response the-
ory models with automated machine learning. arXiv
preprint arXiv:2409.08823 , 2024.
[55] K. M. Smith, S. Geletta, and A. McArdle. The use of
rubrics in the clinical evaluation of podiatric medical
students: objectification of the subjective experience.
Journal of the American Podiatric Medical Association ,
106(1):60–67, 2016.
[56] S. Sonkar, L. Liu, D. B. Mallick, and R. Baraniuk.
Class: A design framework for building intelligent tu-
toring systems based on learning science principles. In
Conference on Empirical Methods in Natural Language
Processing , 2023.
[57] J. Stamper, R. Xiao, and X. Hou. Enhancing llm-
based feedback: Insights from intelligent tutoring sys-
tems and the learning sciences. In International Confer-
ence on Artificial Intelligence in Education , pages 32–
43. Springer, 2024.
[58] M. Tarrant, A. Knierim, S. K. Hayes, and J. Ware. The
frequency of item writing flaws in multiple-choice ques-
tions used in high stakes nursing assessments. Nurse
Education Today , 26(8):662–671, 2006.
[59] S. W. Watson, X. Shan, B. T. George, and M. L. Peters.
Alignment of select elementary science curricula to the
next generation science standards via the equip rubric.
Curriculum Perspectives , 41(1):17–26, 2021.[60] S. Xu, X. Huang, C. K. Lo, G. Chen, and M. S.-y.
Jong. Evaluating the performance of chatgpt and gpt-
4o in coding classroom discourse data: A study of syn-
chronous online mathematics instruction. Computers
and Education: Artificial Intelligence , 7:100325, 2024.
[61] K. P. Yancey, A. Runge, G. Laflair, and P. Mulcaire.
Bert-irt: Accelerating item piloting with bert embed-
dings and explainable irt models. In Proceedings of the
19th Workshop on Innovative Use of NLP for Building
Educational Applications (BEA 2024) , pages 428–438,
2024.