loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2501.01892

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

Authors: Shvetank Prakash, Andrew Cheng, Jason Yik, Arya Tschand, Radhika Ghosal, Ikechukwu Uchendu, Jessica Quaye, Jeffrey Ma, Shreyas Grampurohit, Sofia Giannuzzi, Arnav Balyan, Fin Amin, Aadya Pipersenia, Yash Choudhary, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi

Published: 2025-01-03

Abstract:

We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.

Paper Content: on Alphaxiv
Page 1: 1 QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture Shvetank Prakash∗Andrew Cheng∗Jason Yik∗Arya Tschand∗Radhika Ghosal∗ Ikechukwu Uchendu∗Jessica Quaye∗Jeffrey Ma∗Shreyas Grampurohit§Sofia Giannuzzi∗ Arnav Balyan Fin AminγAadya Pipersenia§Yash Choudhary§Ankita Nayakϕ Amir Yazdanbakhsh†Vijay Janapa Reddi∗ ∗Harvard University§Indian Institute of Technology Bombay γNorth Carolina State UniversityϕQualcomm AI Research†Google DeepMind Abstract —We introduce QuArch, a dataset of 1500 human- validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a signifi- cant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/. I. I NTRODUCTION Generative Artificial Intelligence (GenAI) has transformed domain-specific tools across diverse fields such as medicine, mathematics, law, finance, and software engineering [1]. In contrast, hardware engineering has lagged significantly in adopting AI-driven solutions. This gap is evident in both the limitations of current language models (LMs) and the scarcity of specialized datasets tailored for hardware. For instance, engineering tasks often perform poorly on general benchmarks like MMLU-Pro [2], highlighting the inadequacy of existing models in understanding domain-specific intricacies. While electronic design automation (EDA) has seen recent progress with datasets for tasks such as register-transfer level (RTL) generation [3], [4], security analysis [5], and verification [6], computer architecture remains underrepresented. Without re- sources to benchmark and advance AI models, the field is limited in its ability to improve AI-driven solutions. Datasets play a key role in enabling AI agents. While general-purpose datasets provide broad knowledge, domain- specific datasets are indispensable for developing expertise in areas like computer architecture. These targeted datasets enable AI models to not only demonstrate foundational under- standing but also tackle advanced problem-solving tasks within specific domains [7]. Mastery of domain knowledge is a pre- requisite for sophisticated reasoning [8], [9]. In architecture, such proficiency is essential for developing practical AI-driven tools and agents [10]. Without a deep understanding of core concepts—such as processor execution, memory hierarchy, and parallelism—it becomes impossible to conduct analyses of the complex trade-offs inherent in system design.Example 1 Topic: Processor Architecture Q:In___, each processor has its own local memory system. A:(a) symmetric multiprocessing (b) asymmetric multiprocessing (c) core-based multiprocessing (d) clustered multiprocessing Example 2 Topic: Storage Systems Q:Moving compute closer to the ___ in solid state drives (SSDs) offers higher bandwidth but introduces challenges in managing frequent errors. A:(a) controller (b) NAND dies (c) cache (d) DRAM Example 3 Topic: Architectural Support Q:___ translates the logical address into a physical address. A: (a) MMU (b) Translator (c) Compiler (d) Linker Fig. 1: Example QAs from QuArch for various topics curated from different sources. The bolded answer is correct. To address this challenge, we introduce QuArch (pro- nounced ’quark’)—the first Question-Answering Dataset specifically tailored for Computer Arch itecture. This dataset addresses a critical gap in evaluating LMs understanding of architectural concepts. QuArch comprises 1,500 rigorously curated question-answer pairs, manually annotated by domain experts. It spans both foundational computer architecture principles and contemporary topics, such as deep learning accelerators and quantum computing architectures. QuArch is designed to assess an LMs’ ability to retrieve and apply domain-specific knowledge, a prerequisite for ad- dressing advanced problem-solving challenges. The dataset serves as a benchmark for both theoretical understanding and practical application in computer architecture. Leveraging QuArch, we provide the first comprehensive evaluation of the architectural knowledge encoded in state-of- the-art (S OTA) LMs. Our results show LM accuracies ranging from 39% to 84%, revealing a significant 12% knowledge gap between the best-performing small open-source LM and large closed-source LM. Additionally, our experiments demonstrate the utility of QuArch in fine-tuning small LMs, yielding performance improvements of 5.4%–8.3%.arXiv:2501.01892v2 [cs.AR] 6 Jan 2025 Page 2: 2 13. The VLIW architecture follows _____ approach to achieve parallelism. a) SISD b) MIMD c) MISD d)RAW Curate “Archipedia” Agent (GPT, Claude, Gemini) Generate QA (grounded by text) LLM-as-a-judge & Human Labeling Final QA Dataset QuAr ch In clustered multiprocessing, each processor has its own local memory system... Extract Chunks of Text 1 2 5 43Text Sources Online QAs 13. The VLIW architecture follows _____ approach to achieve parallelism. a) SISD b) MIMD c) MISD d) SIMD Fig. 2: QuArch dataset construction pipeline. II. R ELATED WORK Recent efforts have explored the use of GenAI in hardware design. For example, NVIDIA’s ChipNeMo [11] introduced foundation models tailored for chip design tasks, while other studies have focused on developing LM-based tools for RTL generation [4] and hardware verification [6]. Additionally, evaluation datasets in this domain have been created for specific implementation tasks, such as VerilogEval [3] for RTL generation, and general-purpose benchmarks like MMLU [12], which assess engineering knowledge broadly across disci- plines. However, none of these prior works specifically eval- uate LM understanding of computer architecture concepts. This critical gap limits the ability to assess and advance LM capabilities for architectural challenges. QuArch addresses this unmet need by introducing a focused question-answering dataset specifically designed to evaluate architectural knowl- edge (Figure 1). The dataset combines synthetic data gen- eration [13] with rigorous expert validation, ensuring both comprehensive coverage and high-quality questions. III. Q UARCH In this section, we discuss the construction and characteris- tics of the first version of the dataset: QuArch v0.1. A. Dataset Curation: The Archipedia Corpus The construction of QuArch follows a systematic process, as depicted in Figure 2. We first curated “Archipedia,”1a term we use to describe a comprehensive compilation of computer architecture knowledge that was assembled for this work. Archipedia synthesizes five decades of information, draw- ing from academic literature, educational materials, technical documentation, and industry sources across the computing landscape. This extensive corpus captures the evolution of the field over the past 50 years, incorporating contributions from leading institutions, researchers, and organizations globally. Currently, the corpus exceeds 1 billion tokens in size. Archipedia covers the full spectrum of computing systems, from foundational topics in computer architecture to cutting- edge technologies. It includes domains such as VLSI design 1Data was downloaded and evaluated solely by Harvard University. Fig. 3: Distribution of computer architecture topics in QuArch. and technology, embedded systems and IoT, parallel and dis- tributed processing, hardware-software co-design, and design automation. In addition, the corpus integrates specialized areas such as computer-aided design tools, hardware security, and quantum computing. To ensure comprehensive coverage, this resource is further enriched with advanced lecture materials and thus provides a diverse and balanced knowledge base. B. Dataset Generation: QA Creation The knowledge curation phase (Steps 1and 2in Fig- ure 2) established a foundation of well-accepted architectural concepts and principles by leveraging diverse sources. This effort utilized the Archipedia corpus, which provided a com- prehensive resource for generating questions for QuArch. In the QA generation phase (Step 3), commercial LMs were used to synthesize questions grounded in the academic content of Archipedia to ensure technical rigor. The LMs were tasked with creating cloze-style multiple-choice QAs [7] to balance educational value with practical assessment. The validation phase (Step 4) involved a multi-tiered review process that combined human expertise with LM as- sistance. Questions derived from undergraduate-level sources were reviewed by an expert with graduate-level architectural expertise, supplemented by LM validation using the “LLM-as- a-judge” technique [14]. Advanced topics were evaluated by a pool of eight experts, and QAs were independently validated by three reviewers who reached consensus to ensure accuracy. To further enhance the validation process, human experts and LM reviewers received contextual fragments of the source text, transforming the task into a focused reading compre- hension exercise. This approach enabled the identification and removal of questions lacking definitive answers or those too narrowly scoped for meaningful assessment. The validation process, facilitated through the Label Studio platform [15], ensured the resulting dataset effectively tests both foundational principles and complex system trade-offs. The final dataset (Step 5) supplemented these expert-validated questions with additional QAs freely available from an online education platform, enriching the dataset’s depth and coverage. C. Dataset Coverage: Architecture Topics QuArch v0.1 contains 1,547 question-answer pairs. It cap- tures the breadth of architecture in 13 core areas derived from key themes of the past decade (Figure 3). Processor Page 3: 3 Fig. 4: QuArch accuracy ranges from 39%-84%. Larger mod- els (>70B parameters) attain a max of 84%. Small model (<10B parameters) performance drops 12% in comparison. architecture accounts for the largest proportion of questions (32%), followed by memory systems (22%) and interconnec- tion networks (10%). The topic distribution was determined through a two-stage classification process using OpenAI’s text-embedding-3-large model [16]. In the first stage, the word embedding model generates vector representations of topics and questions, and cosine similarity [17] is used to identify the three most relevant topics for each question. In the second stage, an LM selects the final topic from these candidates for accurate categorization and scalability. Figure 1 illustrates the diversity and depth of architectural concepts covered in QuArch. Foundational questions assess core principles of processor architecture, such as “In clustered multiprocessing, each processor has its own local memory system.” More advanced questions probe emerging trade-offs, exemplified by Question #2 in Figure 1, which addresses near- storage computing: Moving compute closer to NAND dies in solid-state drives increases bandwidth while mitigating the risk of silent data corruption. Finally, Question #3 highlights the dataset’s comprehensive scope by including critical system- level concepts, such as virtual memory and address translation. IV. R ESULTS To evaluate the state of computer architecture knowledge embedded in LMs, we assess knowledge retrieval capabilities of S OTA models and explore opportunities to improve them. A. Experimental Setup We evaluated both open-source and closed-source language models on QuArch. The evaluation included large-scale mod- els such as GPT-4o, Claude-3.5 Sonnet, and Gemini-1.5 Pro, as well as open-source models with parameter counts ranging from 1B to 70B from Google, Meta, and Mistral AI. Each model was presented with questions in a multiple-choice format, requiring the selection of the correct answer from four options. Models were prompted to “act as computer architec- ture experts” and were evaluated in a zero-shot setting, with no additional context provided beyond the questions them- selves. This setup was designed to test the models’ baseline Fig. 5: Performance breakdown across topics. Color inten- sity indicates topic’s relative (intra-model) performance, with darker green showing stronger understanding and darker red showing weaker areas. Memory systems and interconnects are more challenging for current LMs. Benchmarking also shows low performance but only accounts for 1% of the QAs. understanding of architectural concepts. Accuracy (percentage of correct answers) served as the primary evaluation metric. B. Understanding of Architecture Concepts Figure 4 presents the baseline performance of LMs on QuArch. The top-performing model achieves 84% accuracy, reflecting a relatively strong but incomplete understanding of architecture concepts. In particular, a substantial knowledge gap exists: the best-performing small open-source LM ( <10B parameters) underperforms by 12% on the same questions. The observed performance ceiling of 84% suggests that current LMs still have significant room for improvement in under- standing the fundamentals of computer architecture concepts. These findings have important implications for the develop- ment of agentic tools for hardware design [10]. While current models exhibit a reasonable grasp of basic architectural con- cepts, they may require supplementary support or verification mechanisms when addressing complex system-level decisions. C. Analysis by Architecture Topics Figure 5 presents a heatmap illustrating LM performance across various architecture topics. Each cell contains the raw accuracy values for a specific topic and corresponding model. To account for substantial differences in raw accuracy due to varying model capacities, the heatmap employs color gradients to represent performance relative to each model’s overall accuracy . Dark green denotes the strongest performance for a given model, while dark red highlights the weakest, offering a clearer perspective on relative strengths and weaknesses. The analysis reveals distinct patterns in how LMs com- prehend different architecture topics. Models exhibit their strongest performance in topics such as EDA concepts, IP design and manufacturing, parallel processing architecture Page 4: 4 Model MMLU (%) GPQA (%) QuArch (%) GPT-4o 88.7 53.6 83.0 Claude 3.5 Sonnet 88.3 59.4 84.0 TABLE I: S OTA QA benchmark accuracy versus QuArch. fundamentals, and compute workload characterization (rela- tive to other topics). Conversely, significant challenges are observed in three critical areas: memory systems, intercon- nection networks, and benchmarking and measurement. These weaknesses are consistent with expectations, as questions in these areas often involve nuanced system-level interactions and intricate technical trade-offs. Although general trends are evident, individual model performance varies considerably. For example, Llama-3.1-70B demonstrates notable strength in storage systems QAs, outperforming larger models such as Claude 3.5 Sonnet and GPT-4o. Furthermore, it performs well in benchmarking and measurement—a topic that accounts for only 1% of the dataset (Figure 3)—unlike most other models. These findings underscore specific gaps in the architectural knowledge of current LMs. Such disparities likely stem from differences in the data blends used during model training. For instance, the stronger performance in parallel processor architectures, such as GPUs and deep learning accelerators, likely reflects increased community focus and the abundance of academic text on these topics. In contrast, the weaker performance in memory systems and interconnection networks suggests these complex, system-level concepts warrant greater emphasis when developing future AI-based tools for architects. D. QuArch as an Architecture Benchmark To validate the effectiveness of QuArch as a benchmark, we evaluate its ability to distinguish between the capabilities of different LMs and compare it to established QA benchmarks. An effective benchmark should strike a balance—it should neither be trivial to solve nor overly challenging so that it’s unattainable. QuArch satisfies this criterion, as even the top- performing models, such as GPT-4o and Claude 3.5-Sonnet, achieve accuracies of only 83–84%. This highlights substantial room for improvement in architectural understanding. This performance ceiling is consistent with what is ob- served on other well-established QA benchmarks. As shown in Table I, the same models achieve comparable performance (88–89% [18], [19]) on general knowledge benchmarks like MMLU [12], which include engineering-related QAs. This suggests that QuArch’s difficulty aligns well with other tech- nical assessments. In contrast, GPQA [20], one of the most challenging benchmarks available, achieves lower accuracies (54–59%) due to its hand-crafted questions that require ad- vanced QA skills beyond knowledge retrieval. QuArch’s posi- tioning between MMLU and GPQA demonstrates its value as a meaningful and balanced measure of model capabilities. Fur- thermore, the room for improvement, particularly in advanced architecture topics, highlights QuArch’s potential to track progress in LMs’ understanding of computer architecture. However, further expansion and refinement of the dataset will be necessary to fully realize its benchmarking potential.Model Original (%) Fine-Tuned (%) Improvement (%) Gemma-2-2B 38.7±3.0 47.0±3.0 +8.3 Llama-3.2-3B 59.6±1.0 65.0±2.0 +5.4 TABLE II: Mean and standard deviation of test accuracy when fine-tuning on QuArch using repeated random train-test splits. E. QuArch as an Architecture Training Dataset ML datasets serve dual purposes: benchmarking and train- ing. In this section, we investigate whether QuArch can enhance the domain-specific knowledge of LMs through fine- tuning. To this end, we fine-tuned instruction-tuned variants of small open-source LMs using an 80-20 train-test split. To ensure robustness in the training evaluation, we employed repeated random train-test splitting with five different seeds. Table II reports mean test set accuracy improvements after fine-tuning on QuArch across different train-test splits. The results indicate significant performance gains, even with the relatively small size of the dataset. On average, instruction- tuned variants of Gemma-2-2B and Llama-3.2-3B demon- strated improvements ranging from 5.4% to 8.3%. These sub- stantial gains underscore the potential of QuArch to enhance LMs’ understanding of computer architecture. Moreover, the results highlight the importance of developing larger, diverse datasets to further advance AI-based solutions in this domain. V. C ONCLUSION QuArch is the first question-answering dataset for com- puter architecture, providing a means to evaluate domain knowledge. Through QuArch, we uncover both the strengths and limitations of SoTA LMs to reveal substantial room for improvement in this domain. QuArch benchmarks knowledge retrieval—an essential foundation for advancing the integration of AI in computer architecture—but future datasets must build on QuArch to evaluate more complex capabilities, including advanced reasoning, system-level planning, and architectural design. Realizing these goals will require large-scale collabo- ration between academia and industry, ensuring AI tools for architecture evolve to meet the field’s growing demands. VI. A CKNOWLEDGEMENTS We would like to acknowledge and thank the many stu- dents from around the world who were instrumental in the early development efforts of QuArch: Timothy Akert, Adolfo Balderas, Nandini Bhattad, Rushi Chavda, Arjun Chitla, Anjali Choudhary, Hardik Jagga, Satyapragnya Kar, Subash Katel, Anirvan Krishna, Ujjwal Kumar, Sushant Kumar, Shrestha Mishra, Vishnu Nand, Andrew Ogundimu, Arianna Ording, Elsa Oreen, Aarya Pakhale, Umair Paranjpye, Daksh Parikh, Haresh Perera, LuzZelenia Perez-McNeill, Frankie Francisco Pinaminjarez, Arnav Raj, Debarpita Saha, Anmol Saraf, Fa- tima Shah, Akshit Sharma, Aishani Singh, Hartej Soin, Yurun Song, Akarsh Srivastava, Samuel Stankiewicz, Keerthana Sub- ramanian, Kavya Subramanian, Sujay Suribhotla, Aman Tyagi, Maneesh Vaddi, Ankit Walishetti, Judong Wang, Max Zhang, and Junchen Zhao. We also extend our gratitude to Derek Lockhart and James Laudon for their valuable feedback on this paper. Finally, we would like to thank Christos Kozyrakis for the use of a qualifying exam question that helped motivate the need for a dataset assessing foundational domain knowledge. Page 5: 5 REFERENCES [1] H. Chen et al. , “An overview of domain-specific foundation model: key technologies, applications and challenges,” arXiv preprint arXiv:2409.04267 , 2024. [2] Y . Wang et al. , “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” Advances in NeurIPS , 2024. [3] M. Liu et al. , “Verilogeval: Evaluating large language models for verilog code generation,” in 2023 IEEE/ACM ICCAD . IEEE, 2023, pp. 1–8. [4] S. Liu et al. , “Rtlcoder: Outperforming gpt-3.5 in design rtl generation with our open-source dataset and lightweight solution,” in 2024 IEEE LLM Aided Design Workshop (LAD) . IEEE, 2024, pp. 1–5. [5] H. Pearce et al. , “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE SP . IEEE, 2023. [6] M. Cosler et al. , “nl2spec: Interactively translating unstructured natural language to temporal logics with large language models,” in CAV 2023 , 2023, p. 383–396. [7] A. Rogers et al. , “Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension,” ACM Computing Surveys , vol. 55, no. 10, pp. 1–45, 2023. [8] R. G. Duncan, “The role of domain-specific knowledge in generative reasoning about complicated multileveled phenomena,” Cognition and Instruction , vol. 25, no. 4, pp. 271–336, 2007. [9] S. H. Krieger, “Domain knowledge and the teaching of creative legal problem solving,” Clinical L. Rev. , vol. 11, p. 149, 2004. [10] S. Damani et al. , “Warpdrive: An agentic workflow for ninja gpu transformations,” 2024. [11] M. Liu et al. , “Chipnemo: Domain-adapted llms for chip design,” arXiv preprint arXiv:2311.00176 , 2023. [12] D. Hendrycks et al. , “Measuring massive multitask language understand- ing,” ICLR , 2021. [13] S. Shakeri et al. , “End-to-end synthetic data generation for domain adaptation of question answering systems,” EMNLP , 2020. [14] L. Zheng et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in NeurIPS , vol. 36, pp. 46 595–46 623, 2023. [15] M. Tkachenko et al. , “Label Studio: Data labeling software,” 2020-2024. [Online]. Available: https://github.com/HumanSignal/label-studio [16] OpenAI, “Vector Embeddings,” https://platform.openai.com/docs/ guides/embeddings/, 2024. [17] F. Rahutomo et al. , “Semantic cosine similarity,” in The 7th international student conference on advanced science and technology ICAST , vol. 4, no. 1. University of Seoul South Korea, 2012, p. 1. [18] OpenAI, “Hello GPT-4o,” https://openai.com/index/hello-gpt-4o/, 2024. [19] Anthropic, “Introducing Claude 3.5 Sonnet,” https://www.anthropic.com/ news/claude-3-5-sonnet, 2024. [20] D. Rein et al. , “Gpqa: A graduate-level google-proof q&a benchmark,” arXiv preprint arXiv:2311.12022 , 2023.

---