Authors: Shvetank Prakash, Andrew Cheng, Jason Yik, Arya Tschand, Radhika Ghosal, Ikechukwu Uchendu, Jessica Quaye, Jeffrey Ma, Shreyas Grampurohit, Sofia Giannuzzi, Arnav Balyan, Fin Amin, Aadya Pipersenia, Yash Choudhary, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi
Page 1:
1
QuArch: A Question-Answering Dataset for
AI Agents in Computer Architecture
Shvetank Prakash∗Andrew Cheng∗Jason Yik∗Arya Tschand∗Radhika Ghosal∗
Ikechukwu Uchendu∗Jessica Quaye∗Jeffrey Ma∗Shreyas Grampurohit§Sofia Giannuzzi∗
Arnav Balyan Fin AminγAadya Pipersenia§Yash Choudhary§Ankita Nayakϕ
Amir Yazdanbakhsh†Vijay Janapa Reddi∗
∗Harvard University§Indian Institute of Technology Bombay
γNorth Carolina State UniversityϕQualcomm AI Research†Google DeepMind
Abstract —We introduce QuArch, a dataset of 1500 human-
validated question-answer pairs designed to evaluate and enhance
language models’ understanding of computer architecture. The
dataset covers areas including processor design, memory systems,
and performance optimization. Our analysis highlights a signifi-
cant performance gap: the best closed-source model achieves 84%
accuracy, while the top small open-source model reaches 72%.
We observe notable struggles in memory systems, interconnection
networks, and benchmarking. Fine-tuning with QuArch improves
small model accuracy by up to 8%, establishing a foundation for
advancing AI-driven computer architecture research. The dataset
and leaderboard are at https://harvard-edge.github.io/QuArch/.
I. I NTRODUCTION
Generative Artificial Intelligence (GenAI) has transformed
domain-specific tools across diverse fields such as medicine,
mathematics, law, finance, and software engineering [1]. In
contrast, hardware engineering has lagged significantly in
adopting AI-driven solutions. This gap is evident in both the
limitations of current language models (LMs) and the scarcity
of specialized datasets tailored for hardware. For instance,
engineering tasks often perform poorly on general benchmarks
like MMLU-Pro [2], highlighting the inadequacy of existing
models in understanding domain-specific intricacies. While
electronic design automation (EDA) has seen recent progress
with datasets for tasks such as register-transfer level (RTL)
generation [3], [4], security analysis [5], and verification [6],
computer architecture remains underrepresented. Without re-
sources to benchmark and advance AI models, the field is
limited in its ability to improve AI-driven solutions.
Datasets play a key role in enabling AI agents. While
general-purpose datasets provide broad knowledge, domain-
specific datasets are indispensable for developing expertise
in areas like computer architecture. These targeted datasets
enable AI models to not only demonstrate foundational under-
standing but also tackle advanced problem-solving tasks within
specific domains [7]. Mastery of domain knowledge is a pre-
requisite for sophisticated reasoning [8], [9]. In architecture,
such proficiency is essential for developing practical AI-driven
tools and agents [10]. Without a deep understanding of core
concepts—such as processor execution, memory hierarchy,
and parallelism—it becomes impossible to conduct analyses
of the complex trade-offs inherent in system design.Example 1
Topic: Processor Architecture
Q:In___, each processor has its own local memory system.
A:(a) symmetric multiprocessing (b) asymmetric multiprocessing
(c) core-based multiprocessing (d) clustered multiprocessing
Example 2
Topic: Storage Systems
Q:Moving compute closer to the ___ in solid state drives
(SSDs) offers higher bandwidth but introduces challenges in
managing frequent errors.
A:(a) controller (b) NAND dies (c) cache (d) DRAM
Example 3
Topic: Architectural Support
Q:___ translates the logical address into a physical address.
A: (a) MMU (b) Translator (c) Compiler (d) Linker
Fig. 1: Example QAs from QuArch for various topics curated
from different sources. The bolded answer is correct.
To address this challenge, we introduce QuArch (pro-
nounced ’quark’)—the first Question-Answering Dataset
specifically tailored for Computer Arch itecture. This dataset
addresses a critical gap in evaluating LMs understanding of
architectural concepts. QuArch comprises 1,500 rigorously
curated question-answer pairs, manually annotated by domain
experts. It spans both foundational computer architecture
principles and contemporary topics, such as deep learning
accelerators and quantum computing architectures.
QuArch is designed to assess an LMs’ ability to retrieve
and apply domain-specific knowledge, a prerequisite for ad-
dressing advanced problem-solving challenges. The dataset
serves as a benchmark for both theoretical understanding and
practical application in computer architecture.
Leveraging QuArch, we provide the first comprehensive
evaluation of the architectural knowledge encoded in state-of-
the-art (S OTA) LMs. Our results show LM accuracies ranging
from 39% to 84%, revealing a significant 12% knowledge gap
between the best-performing small open-source LM and large
closed-source LM. Additionally, our experiments demonstrate
the utility of QuArch in fine-tuning small LMs, yielding
performance improvements of 5.4%–8.3%.arXiv:2501.01892v2 [cs.AR] 6 Jan 2025
Page 2:
2
13. The VLIW
architecture
follows
_____
approach to
achieve
parallelism.
a) SISD
b) MIMD
c) MISD
d)RAW
Curate “Archipedia”
Agent
(GPT,
Claude,
Gemini) Generate QA
(grounded by text)
LLM-as-a-judge &
Human Labeling Final QA
Dataset
QuAr ch
In clustered
multiprocessing,
each processor
has its own
local memory
system... Extract
Chunks of
Text 1 2
5 43Text Sources Online QAs
13. The VLIW
architecture
follows _____
approach to
achieve
parallelism.
a) SISD
b) MIMD
c) MISD
d) SIMD
Fig. 2: QuArch dataset construction pipeline.
II. R ELATED WORK
Recent efforts have explored the use of GenAI in hardware
design. For example, NVIDIA’s ChipNeMo [11] introduced
foundation models tailored for chip design tasks, while other
studies have focused on developing LM-based tools for RTL
generation [4] and hardware verification [6]. Additionally,
evaluation datasets in this domain have been created for
specific implementation tasks, such as VerilogEval [3] for RTL
generation, and general-purpose benchmarks like MMLU [12],
which assess engineering knowledge broadly across disci-
plines. However, none of these prior works specifically eval-
uate LM understanding of computer architecture concepts.
This critical gap limits the ability to assess and advance
LM capabilities for architectural challenges. QuArch addresses
this unmet need by introducing a focused question-answering
dataset specifically designed to evaluate architectural knowl-
edge (Figure 1). The dataset combines synthetic data gen-
eration [13] with rigorous expert validation, ensuring both
comprehensive coverage and high-quality questions.
III. Q UARCH
In this section, we discuss the construction and characteris-
tics of the first version of the dataset: QuArch v0.1.
A. Dataset Curation: The Archipedia Corpus
The construction of QuArch follows a systematic process,
as depicted in Figure 2. We first curated “Archipedia,”1a term
we use to describe a comprehensive compilation of computer
architecture knowledge that was assembled for this work.
Archipedia synthesizes five decades of information, draw-
ing from academic literature, educational materials, technical
documentation, and industry sources across the computing
landscape. This extensive corpus captures the evolution of the
field over the past 50 years, incorporating contributions from
leading institutions, researchers, and organizations globally.
Currently, the corpus exceeds 1 billion tokens in size.
Archipedia covers the full spectrum of computing systems,
from foundational topics in computer architecture to cutting-
edge technologies. It includes domains such as VLSI design
1Data was downloaded and evaluated solely by Harvard University.
Fig. 3: Distribution of computer architecture topics in QuArch.
and technology, embedded systems and IoT, parallel and dis-
tributed processing, hardware-software co-design, and design
automation. In addition, the corpus integrates specialized areas
such as computer-aided design tools, hardware security, and
quantum computing. To ensure comprehensive coverage, this
resource is further enriched with advanced lecture materials
and thus provides a diverse and balanced knowledge base.
B. Dataset Generation: QA Creation
The knowledge curation phase (Steps 1and 2in Fig-
ure 2) established a foundation of well-accepted architectural
concepts and principles by leveraging diverse sources. This
effort utilized the Archipedia corpus, which provided a com-
prehensive resource for generating questions for QuArch.
In the QA generation phase (Step 3), commercial LMs
were used to synthesize questions grounded in the academic
content of Archipedia to ensure technical rigor. The LMs were
tasked with creating cloze-style multiple-choice QAs [7] to
balance educational value with practical assessment.
The validation phase (Step 4) involved a multi-tiered
review process that combined human expertise with LM as-
sistance. Questions derived from undergraduate-level sources
were reviewed by an expert with graduate-level architectural
expertise, supplemented by LM validation using the “LLM-as-
a-judge” technique [14]. Advanced topics were evaluated by a
pool of eight experts, and QAs were independently validated
by three reviewers who reached consensus to ensure accuracy.
To further enhance the validation process, human experts
and LM reviewers received contextual fragments of the source
text, transforming the task into a focused reading compre-
hension exercise. This approach enabled the identification and
removal of questions lacking definitive answers or those too
narrowly scoped for meaningful assessment. The validation
process, facilitated through the Label Studio platform [15],
ensured the resulting dataset effectively tests both foundational
principles and complex system trade-offs. The final dataset
(Step 5) supplemented these expert-validated questions with
additional QAs freely available from an online education
platform, enriching the dataset’s depth and coverage.
C. Dataset Coverage: Architecture Topics
QuArch v0.1 contains 1,547 question-answer pairs. It cap-
tures the breadth of architecture in 13 core areas derived
from key themes of the past decade (Figure 3). Processor
Page 3:
3
Fig. 4: QuArch accuracy ranges from 39%-84%. Larger mod-
els (>70B parameters) attain a max of 84%. Small model
(<10B parameters) performance drops 12% in comparison.
architecture accounts for the largest proportion of questions
(32%), followed by memory systems (22%) and interconnec-
tion networks (10%). The topic distribution was determined
through a two-stage classification process using OpenAI’s
text-embedding-3-large model [16]. In the first stage,
the word embedding model generates vector representations
of topics and questions, and cosine similarity [17] is used to
identify the three most relevant topics for each question. In
the second stage, an LM selects the final topic from these
candidates for accurate categorization and scalability.
Figure 1 illustrates the diversity and depth of architectural
concepts covered in QuArch. Foundational questions assess
core principles of processor architecture, such as “In clustered
multiprocessing, each processor has its own local memory
system.” More advanced questions probe emerging trade-offs,
exemplified by Question #2 in Figure 1, which addresses near-
storage computing: Moving compute closer to NAND dies in
solid-state drives increases bandwidth while mitigating the risk
of silent data corruption. Finally, Question #3 highlights the
dataset’s comprehensive scope by including critical system-
level concepts, such as virtual memory and address translation.
IV. R ESULTS
To evaluate the state of computer architecture knowledge
embedded in LMs, we assess knowledge retrieval capabilities
of S OTA models and explore opportunities to improve them.
A. Experimental Setup
We evaluated both open-source and closed-source language
models on QuArch. The evaluation included large-scale mod-
els such as GPT-4o, Claude-3.5 Sonnet, and Gemini-1.5 Pro,
as well as open-source models with parameter counts ranging
from 1B to 70B from Google, Meta, and Mistral AI. Each
model was presented with questions in a multiple-choice
format, requiring the selection of the correct answer from four
options. Models were prompted to “act as computer architec-
ture experts” and were evaluated in a zero-shot setting, with
no additional context provided beyond the questions them-
selves. This setup was designed to test the models’ baseline
Fig. 5: Performance breakdown across topics. Color inten-
sity indicates topic’s relative (intra-model) performance, with
darker green showing stronger understanding and darker red
showing weaker areas. Memory systems and interconnects are
more challenging for current LMs. Benchmarking also shows
low performance but only accounts for 1% of the QAs.
understanding of architectural concepts. Accuracy (percentage
of correct answers) served as the primary evaluation metric.
B. Understanding of Architecture Concepts
Figure 4 presents the baseline performance of LMs on
QuArch. The top-performing model achieves 84% accuracy,
reflecting a relatively strong but incomplete understanding of
architecture concepts. In particular, a substantial knowledge
gap exists: the best-performing small open-source LM ( <10B
parameters) underperforms by 12% on the same questions. The
observed performance ceiling of 84% suggests that current
LMs still have significant room for improvement in under-
standing the fundamentals of computer architecture concepts.
These findings have important implications for the develop-
ment of agentic tools for hardware design [10]. While current
models exhibit a reasonable grasp of basic architectural con-
cepts, they may require supplementary support or verification
mechanisms when addressing complex system-level decisions.
C. Analysis by Architecture Topics
Figure 5 presents a heatmap illustrating LM performance
across various architecture topics. Each cell contains the raw
accuracy values for a specific topic and corresponding model.
To account for substantial differences in raw accuracy due to
varying model capacities, the heatmap employs color gradients
to represent performance relative to each model’s overall
accuracy . Dark green denotes the strongest performance for a
given model, while dark red highlights the weakest, offering
a clearer perspective on relative strengths and weaknesses.
The analysis reveals distinct patterns in how LMs com-
prehend different architecture topics. Models exhibit their
strongest performance in topics such as EDA concepts, IP
design and manufacturing, parallel processing architecture
Page 4:
4
Model MMLU (%) GPQA (%) QuArch (%)
GPT-4o 88.7 53.6 83.0
Claude 3.5 Sonnet 88.3 59.4 84.0
TABLE I: S OTA QA benchmark accuracy versus QuArch.
fundamentals, and compute workload characterization (rela-
tive to other topics). Conversely, significant challenges are
observed in three critical areas: memory systems, intercon-
nection networks, and benchmarking and measurement. These
weaknesses are consistent with expectations, as questions in
these areas often involve nuanced system-level interactions
and intricate technical trade-offs. Although general trends are
evident, individual model performance varies considerably.
For example, Llama-3.1-70B demonstrates notable strength in
storage systems QAs, outperforming larger models such as
Claude 3.5 Sonnet and GPT-4o. Furthermore, it performs well
in benchmarking and measurement—a topic that accounts for
only 1% of the dataset (Figure 3)—unlike most other models.
These findings underscore specific gaps in the architectural
knowledge of current LMs. Such disparities likely stem from
differences in the data blends used during model training.
For instance, the stronger performance in parallel processor
architectures, such as GPUs and deep learning accelerators,
likely reflects increased community focus and the abundance
of academic text on these topics. In contrast, the weaker
performance in memory systems and interconnection networks
suggests these complex, system-level concepts warrant greater
emphasis when developing future AI-based tools for architects.
D. QuArch as an Architecture Benchmark
To validate the effectiveness of QuArch as a benchmark, we
evaluate its ability to distinguish between the capabilities of
different LMs and compare it to established QA benchmarks.
An effective benchmark should strike a balance—it should
neither be trivial to solve nor overly challenging so that it’s
unattainable. QuArch satisfies this criterion, as even the top-
performing models, such as GPT-4o and Claude 3.5-Sonnet,
achieve accuracies of only 83–84%. This highlights substantial
room for improvement in architectural understanding.
This performance ceiling is consistent with what is ob-
served on other well-established QA benchmarks. As shown
in Table I, the same models achieve comparable performance
(88–89% [18], [19]) on general knowledge benchmarks like
MMLU [12], which include engineering-related QAs. This
suggests that QuArch’s difficulty aligns well with other tech-
nical assessments. In contrast, GPQA [20], one of the most
challenging benchmarks available, achieves lower accuracies
(54–59%) due to its hand-crafted questions that require ad-
vanced QA skills beyond knowledge retrieval. QuArch’s posi-
tioning between MMLU and GPQA demonstrates its value as
a meaningful and balanced measure of model capabilities. Fur-
thermore, the room for improvement, particularly in advanced
architecture topics, highlights QuArch’s potential to track
progress in LMs’ understanding of computer architecture.
However, further expansion and refinement of the dataset will
be necessary to fully realize its benchmarking potential.Model Original (%) Fine-Tuned (%) Improvement (%)
Gemma-2-2B 38.7±3.0 47.0±3.0 +8.3
Llama-3.2-3B 59.6±1.0 65.0±2.0 +5.4
TABLE II: Mean and standard deviation of test accuracy when
fine-tuning on QuArch using repeated random train-test splits.
E. QuArch as an Architecture Training Dataset
ML datasets serve dual purposes: benchmarking and train-
ing. In this section, we investigate whether QuArch can
enhance the domain-specific knowledge of LMs through fine-
tuning. To this end, we fine-tuned instruction-tuned variants
of small open-source LMs using an 80-20 train-test split. To
ensure robustness in the training evaluation, we employed
repeated random train-test splitting with five different seeds.
Table II reports mean test set accuracy improvements after
fine-tuning on QuArch across different train-test splits. The
results indicate significant performance gains, even with the
relatively small size of the dataset. On average, instruction-
tuned variants of Gemma-2-2B and Llama-3.2-3B demon-
strated improvements ranging from 5.4% to 8.3%. These sub-
stantial gains underscore the potential of QuArch to enhance
LMs’ understanding of computer architecture. Moreover, the
results highlight the importance of developing larger, diverse
datasets to further advance AI-based solutions in this domain.
V. C ONCLUSION
QuArch is the first question-answering dataset for com-
puter architecture, providing a means to evaluate domain
knowledge. Through QuArch, we uncover both the strengths
and limitations of SoTA LMs to reveal substantial room for
improvement in this domain. QuArch benchmarks knowledge
retrieval—an essential foundation for advancing the integration
of AI in computer architecture—but future datasets must build
on QuArch to evaluate more complex capabilities, including
advanced reasoning, system-level planning, and architectural
design. Realizing these goals will require large-scale collabo-
ration between academia and industry, ensuring AI tools for
architecture evolve to meet the field’s growing demands.
VI. A CKNOWLEDGEMENTS
We would like to acknowledge and thank the many stu-
dents from around the world who were instrumental in the
early development efforts of QuArch: Timothy Akert, Adolfo
Balderas, Nandini Bhattad, Rushi Chavda, Arjun Chitla, Anjali
Choudhary, Hardik Jagga, Satyapragnya Kar, Subash Katel,
Anirvan Krishna, Ujjwal Kumar, Sushant Kumar, Shrestha
Mishra, Vishnu Nand, Andrew Ogundimu, Arianna Ording,
Elsa Oreen, Aarya Pakhale, Umair Paranjpye, Daksh Parikh,
Haresh Perera, LuzZelenia Perez-McNeill, Frankie Francisco
Pinaminjarez, Arnav Raj, Debarpita Saha, Anmol Saraf, Fa-
tima Shah, Akshit Sharma, Aishani Singh, Hartej Soin, Yurun
Song, Akarsh Srivastava, Samuel Stankiewicz, Keerthana Sub-
ramanian, Kavya Subramanian, Sujay Suribhotla, Aman Tyagi,
Maneesh Vaddi, Ankit Walishetti, Judong Wang, Max Zhang,
and Junchen Zhao. We also extend our gratitude to Derek
Lockhart and James Laudon for their valuable feedback on this
paper. Finally, we would like to thank Christos Kozyrakis for
the use of a qualifying exam question that helped motivate the
need for a dataset assessing foundational domain knowledge.
Page 5:
5
REFERENCES
[1] H. Chen et al. , “An overview of domain-specific foundation
model: key technologies, applications and challenges,” arXiv preprint
arXiv:2409.04267 , 2024.
[2] Y . Wang et al. , “Mmlu-pro: A more robust and challenging multi-task
language understanding benchmark,” Advances in NeurIPS , 2024.
[3] M. Liu et al. , “Verilogeval: Evaluating large language models for verilog
code generation,” in 2023 IEEE/ACM ICCAD . IEEE, 2023, pp. 1–8.
[4] S. Liu et al. , “Rtlcoder: Outperforming gpt-3.5 in design rtl generation
with our open-source dataset and lightweight solution,” in 2024 IEEE
LLM Aided Design Workshop (LAD) . IEEE, 2024, pp. 1–5.
[5] H. Pearce et al. , “Examining zero-shot vulnerability repair with large
language models,” in 2023 IEEE SP . IEEE, 2023.
[6] M. Cosler et al. , “nl2spec: Interactively translating unstructured natural
language to temporal logics with large language models,” in CAV 2023 ,
2023, p. 383–396.
[7] A. Rogers et al. , “Qa dataset explosion: A taxonomy of nlp resources
for question answering and reading comprehension,” ACM Computing
Surveys , vol. 55, no. 10, pp. 1–45, 2023.
[8] R. G. Duncan, “The role of domain-specific knowledge in generative
reasoning about complicated multileveled phenomena,” Cognition and
Instruction , vol. 25, no. 4, pp. 271–336, 2007.
[9] S. H. Krieger, “Domain knowledge and the teaching of creative legal
problem solving,” Clinical L. Rev. , vol. 11, p. 149, 2004.
[10] S. Damani et al. , “Warpdrive: An agentic workflow for ninja gpu
transformations,” 2024.
[11] M. Liu et al. , “Chipnemo: Domain-adapted llms for chip design,” arXiv
preprint arXiv:2311.00176 , 2023.
[12] D. Hendrycks et al. , “Measuring massive multitask language understand-
ing,” ICLR , 2021.
[13] S. Shakeri et al. , “End-to-end synthetic data generation for domain
adaptation of question answering systems,” EMNLP , 2020.
[14] L. Zheng et al. , “Judging llm-as-a-judge with mt-bench and chatbot
arena,” Advances in NeurIPS , vol. 36, pp. 46 595–46 623, 2023.
[15] M. Tkachenko et al. , “Label Studio: Data labeling software,” 2020-2024.
[Online]. Available: https://github.com/HumanSignal/label-studio
[16] OpenAI, “Vector Embeddings,” https://platform.openai.com/docs/
guides/embeddings/, 2024.
[17] F. Rahutomo et al. , “Semantic cosine similarity,” in The 7th international
student conference on advanced science and technology ICAST , vol. 4,
no. 1. University of Seoul South Korea, 2012, p. 1.
[18] OpenAI, “Hello GPT-4o,” https://openai.com/index/hello-gpt-4o/, 2024.
[19] Anthropic, “Introducing Claude 3.5 Sonnet,” https://www.anthropic.com/
news/claude-3-5-sonnet, 2024.
[20] D. Rein et al. , “Gpqa: A graduate-level google-proof q&a benchmark,”
arXiv preprint arXiv:2311.12022 , 2023.