loader
Generating audio...

arxiv

Paper 2503.05577

opXRD: Open Experimental Powder X-ray Diffraction Database

Authors: Daniel Hollarek, Henrik Schopmans, Jona Östreicher, Jonas Teufel, Bin Cao, Adie Alwen, Simon Schweidler, Mriganka Singh, Tim Kodalle, Hanlin Hu, Gregoire Heymans, Maged Abdelsamie, Arthur Hardiagon, Alexander Wieczorek, Siarhei Zhuk, Ruth Schwaiger, Sebastian Siol, François-Xavier Coudert, Moritz Wolf, Carolin M. Sutter-Fella, Ben Breitung, Andrea M. Hodge, Tong-yi Zhang, Pascal Friederich

Published: 2025-03-07

Abstract:

Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs.

Paper Content:
Page 1: opXRD: Open Experimental Powder X-ray Diffraction Database Daniel Hollarek1,2, Henrik Schopmans1,2, Jona Östreicher1,2, Jonas Teufel1,2, Bin Cao3, Adie Alwen4, Simon Schweidler2, Mriganka Singh5, Tim Kodalle5,6, Hanlin Hu7, Gregoire Heymans8, Maged Abdelsamie9,10, Arthur Hardiagon11, Alexander Wieczorek12, Siarhei Zhuk12, Ruth Schwaiger13, Sebastian Siol12, François-Xavier Coudert11, Moritz Wolf14, Carolin M. Sutter-Fella5, Ben Breitung2, Andrea M. Hodge4, Tong-yi Zhang3, and Pascal Friederich1,2,* 1Institute of Theoretical Informatics, Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe, Germany. E-mail: pascal.friederich@kit.edu 2Institute of Nanotechnology, Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe, Germany 3Guangzhou Municipal Key Laboratory of Materials Informatics, Advanced Materials Thrust, Hong Kong University of Science and Technology (Guangzhou) (HKUST), Guangzhou 511400, China 4Department of Chemical Engineering and Materials Science, University of Southern California (USC), Los Angeles CA 90089, USA 5Molecular Foundry Division, Lawrence Berkeley National Laboratory (LBNL), Berkeley 94720 CA, USA 6Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley 94720 CA, USA 7Hoffmann Institute of Advanced Materials, Shenzhen Polytechnic, Shenzhen 518055, China 8Lawrence Berkeley National Laboratory (LBNL), Chemical Sciences Division, Berkeley 94720 CA, USA 9Material Science and Engineering Department, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia 10Interdisciplinary Research Center for Intelligent Manufacturing and Robotics, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia 11Chimie ParisTech, PSL University, CNRS, Institut de Recherche de Chimie Paris, 75005 Paris, France 12Empa–Swiss Federal Laboratories for Materials Science and Technology (EMPA), 8600 Dübendorf, Switzerland 13Institute of Energy Materials and Devices, Forschungszentrum Juelich GmbH, 52425 Juelich, Germany 14Engler-Bunte-Institut & Institute of Catalysis Research and Technology, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany *Corresponding author: pascal.friederich@kit.edu Abstract Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presentsasignificantchallengetoautomationandabottleneckinhigh-throughputdiscoveryin self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental pat- terns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data canbeusedtoevaluatetheperformanceofmodelsonexperimentaldataandunlabeledopXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92,552 diffractograms, 2179 of them labeled, from a wide spec- trum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs. 1arXiv:2503.05577v2 [cond-mat.mtrl-sci] 10 Mar 2025 Page 2: 1 Introduction The advent of high-throughput experiments holds the prospect of significantly accelerating the speed of materials discovery[1]. The synthesis and characterization of novel materials are becoming increasingly efficient and automated, increasing the throughput of samples in experimentation pipelines[2–4]. After fabricating a new material, a number of analysis techniques can be used to characterize the sample. One method that can be used for phase identification, phase quantification, grain size characterization, and to determine the crystal structure of a new material is powder X-ray diffraction (pXRD). When using pXRD measurements, crystal structures are typically determined through Rietveld refinement. In Rietveld refinement, an initial crystal structure model is fitted to the observed diffractogram by iteratively updating the structural model. Each update of the structural model seeks to minimize the difference between the observed diffractogram and the diffractogram simulated from the current structural model[5,6]. As Rietveld refinement is a local optimization method, the result of the refinement procedure is generally only as good as the initial structural model the process started from. Manually performing Rietveld refinement is time-consuming and often requires expert knowledge. It is not scalable to the degree required to keep up with advances in throughput and efficiency in other steps of the experimentation pipeline. The refinement process requires the operator to determine an initial structural model from which the refinement can start and as well as initial valuesforparametersthatcharacterizethebackground[7]. Thestructuralmodelisusuallyobtained using search-match software, which identifies crystal structures with similar powder diffraction patterns from a database of crystal structures with accompanying powder diffraction patterns. However, an initial structural model obtained from such a database is not guaranteed to lead to an accurate structure solution through Rietveld refinement, especially not for novel structures. Additionally, attempting to refine all crystal structure parameters at once is known to lead to unphysical results[4]. Hence parameters are refined iteratively, with each iteration only refining a limited set of parameters. Finding the correct order in which to refine structure parameters and finding the correct values for initial background parameters both present problems that add to the difficulty of the refinement process. Machine learning has the potential to speed up the manual analysis of powder diffractograms and keep pace with an automated high-throughput experimentation environment[8,9]. Models can be either trained to predict crystal structure information directly given a diffractogram, or they can be used to automate the conventional refinement workflow. In the latter case a model would first predict an initial crystal structure[9]which is then refined by a second model trained to perform the refinement process[10]. So far due to an absence of labeled datasets with experimental diffrac- tograms[11], machine learning in this domain has largely relied on diffractograms simulated from known structures[12,13]or, most recently, from generated synthetic crystals[14]. Models trained on datasets with simulated diffractograms have already shown strong performance in predicting phases[12,15,16], lattice parameters[17–20], spacegroup[12,14,20–26], and crystallite size[17,26]from sim- ulated diffractograms. However, the performance substantially drops off when these models are applied to data originating from experiments[11,14,20,21,23]. This discrepancy in performance arises due to imperfections in experimental data which are not present in diffraction patterns modeled under ideal conditions. This is discussed in more detail below. Both labeled and unlabeled datasets of experimental powder diffractograms hold significant value for machine learning-based pXRD analysis, particularly with regard to bridging the performance gap between simulated and experimental domains. Labeled experimental data can be used to test and benchmark existing and new automated analysis approaches. This enables researchers to gauge how well a given model would perform under real-world conditions if integrated into an automated experimentation pipeline. Unlabeled experimental data enables machine learning researchers to evaluate how closely their simulations represent experimental data and modify their simulation algorithms accordingly. Unlabeled data can also find applications in transfer learning approaches to transfer model capabilities from the domain of simulated diffractograms to the domain of experimental diffractograms. While some experimental powder databases exist, their utility is limited by the fact that they are either small or not openly accessible. In this work, we introduce an open powder X-ray diffraction (opXRD) database featuring a broad 2 Page 3: range of patterns collected from experiments. With a total of 92,552patterns collected from 6 contributing institutions, the opXRD database exceeds the size of the previously largest database of openly accessible experimental powder diffraction data by two orders of magnitude. To the best of our knowledge, the largest database of this type is the RRUFF database, containing 1290 experimental powder diffraction patterns[27]. Larger commerical datasets such as the PDF5+[28] and the Linus Pauling File[29]exist, but their utility is limited by fees and restrictive licenses. License terms of commercial datasets, such as the PDF5+ and the Linus Pauling File, prohibit or restrict the publication of models trained on their data. In contrast, the opXRD database is both free and imposes no restrictions on how its data is used. Fig. (1) provides an overview of machine learning workflows enabled and supported by the opXRD database. Powder X-Ray Diffraction (pXRD)opXRD DatabasepXRD Simulation Sample92,552 patterns from 6 contributing institutionsX-RayMeasured Diffractogram Crystal Structures (ICSD, Synth., ...) Mat XRDAI Simulated Diffractogram Model Training Benchmarking/Evaluation U-Network Convolutional Network Transformer Space Group ClassificationModel Comparison Realistic Test Distribution Data DistributionTrain Test Joint Learning using Real and Simulated Patterns, Transfer Learning, ...Improving Physics Simulations Lattice Parameter RegressionModels Tsaks Realistic Evaluations on Real-World Data Figure 1: Experimental powder X-ray diffraction (pXRD) patterns from several contributors are collected in the opXRDdatabase. The proposed open-access database of experimental data aims to support each step in the pXRD-related machine learning workflow by informing better physics simulations, supplying model training data, and providing a foundation for realistic performance evaluations. Of the 92,552 patterns in the opXRD database 2179 patterns come with at least partial struc- tural information of the underlying sample. Of these 2179 labeled patterns, more than 900 have full structural labels including atomic coordinates. This constitutes an experimental pXRD test dataset that is larger in size, richer in labels, and broader in represented experimental setups than the RRUFF database, which only provides lattice parameters as labels[30]. However, since the majority of the opXRD database is unlabeled we also want to further discuss the uses of unlabeled data, including its role in improving pattern simulations and its application in transfer learning approaches. The neglected effects that lead to discrepancies between simulated patterns and patterns stem- ming from experiments are largely known. Unaccounted effects may include preferred crystallite orientation, variations in grain size, crystal defects, the impact of temperature on the scattering process, internal stress, the non-monochromaticity of the X-ray source, and X-ray-induced fluo- rescence[21,31,32]. Additionally, varying experimental setups produce distinct powder diffraction patterns on the same sample. Features that may vary between experimental setups include the shape of diffraction peaks, the wavelength and polarization of the employed X-ray source, and the detector geometry[21,31,32]. The recorded scattering angles may also be slightly falsified if the sample is displaced from its intended position[21,33]. As these and more neglected effects are in- tegrated into the simulation process, real powder diffraction data can be used to evaluate how closely simulated data matches up with real data. While direct comparisons are only possible on labeled patterns, comparing the strength and prevalence of features between simulated and real data can nevertheless provide information about the fidelity of the simulation. Taking into account all neglected effects without making approximations will incur significant computational costs that 3 Page 4: will lower the size of the generated training data. A more efficient approach could be to use real experimental data to identify the effects that have the largest impact in practice and model them heuristically. The second way in which unlabeled experimental data can serve to bridge the performance gap betweensimulatedandexperimentaldomainsisthroughtransferlearning. Theobjectiveoftransfer learning is to transfer the capabilities of a model learned on a source domain in which labeled data is abundant to a target domain in which labeled data is sparse[34]. In this context, the source domain is simulated powder diffraction patterns and the target domain is experimental powder diffraction patterns. Many approaches to transfer learning have been proposed, particularly in the domain of image classification[35,36]. These existing techniques can be adapted to facilitate transfer learning in the context of pXRD patterns. Seddiki et al. have already successfully applied transfer learning in the domain of mass spectrometry to boost the accuracy of mass spectrum classification models[37]. Since both mass spectrometry data and pXRD data are one-dimensional, this work demonstrates the merit of transfer learning in a setting similar to pXRD. The opXRD database is intended as a growing, community-driven initiative. The database we present here is the first version, but we hope to further increase the database size through active engagement with the pXRD community. Our primary objective is to minimize the effort and thus the barrier to contributing experimental data to the opXRD database. Thus, we developed a program that helps to find and share data from pXRD lab computers. Users can select their most common pXRD file types, the program lists all files of that type, and users can select or deselect certain folders or files for sharing. Selected contributions will be uploaded to opXRD, processed to a common file format, and—if wanted—published on Zenodo on behalf of the contributors, before becoming part of the opXRD database. If labels are available, they can be shared with opXRD as well. Further details can be found on the opXRD website ( https://xrd.aimat.science/ ). An overview of this process is given in Fig. (2) below. Post-processing Mat XRDAI Figure 2: Overview of the data collection pipeline. Datasets are submitted using an online sub- mission form, optionally with the help of our submission helper software. After post-processing and data homogenization, we offer the creation of a Zenodo entry for each user submission and subsequently include the submission in the opXRD database. As argued by Aranda and Kroon-Batenburg et al.[38,39], sharing raw powder diffraction data is not only in the interest of furthering machine learning research but is also in line with open science principles. It furthersthe abilityof other researchers to reproducepublishedworkand in turn, adds to the credibility of the publisher of the data. Compared to publishing data individually, publishing data on the opXRD database has the added benefit of contributing to a large, homogenous dataset with a standardized interface. This makes the data more easily accessible to other researchers 4 Page 5: and provides more value to researchers seeking large quantities of data. However, further data annotation with metadata is required to fully fulfill the FAIR data principles. The opXRD database contains pXRD patterns from single and multiphase materials from a wide variety of material classes, including high-entropy materials, perovskites, and commercial catalysts. Some of the XRD data was collected on thin-films rather than on true powder samples, which may influence the quality of the data in regards to full structure resolution. Additionally, some of the data was collected in grazing-angle geometry rather than in the usual Bragg-Brentano geometry employed in powder diffraction. The broad range of available experimental samples contained in the opXRD v1.0 database makes it possible to apply state-of-the-art ML approaches to the domain of pXRD analysis. We hope that the opXRD database can drive ML research in this field towards more advanced automated analysis workflows that can accelerate materials science research through ready application in high-throughput experimentation pipelines. Details of the experiments of research groups contributing to the opXRD database are discussed in Section (3). A detailed description of how to acquire and use opXRD data is given in Section (4), and Section (5) describes how further data can be contributed. Review of machine learning-based pXRD analysis To showcase the need for datasets such as the one presented in this publication, we now discuss some recent approaches that apply machine learning methods to classification and regression tasks for powder diffractograms. In 2020, Lee et al.trained a deep convolutional neural network (CNN) using simulated diffrac- tograms based on structures from the ICSD, which is able to classify occurring phases in diffrac- tograms of a specific compound pool[40]. In 2022, they furthermore developed models based on fully convolutional neural networks and transformer encoders that predict the crystal system, the spacegroup, and other structural properties, such as the band gap[41]. With their best model for the crystal system prediction on ICSD structures, they achieved a test accuracy of 92.2 %. In 2017, Parket al.reached a test accuracy of roughly 81 %for a CNN, which classifies space groups of simulated single-phase diffractograms[12]. A regression analysis on lattice parameters within a broader framework encompassing all material classes was conducted by Chitturi et al[18]in 2021. They developed a distinct CNN for each crystal system, utilizing a merged dataset from both the ICSD and the Cambridge Structural Database, and managed to achieve a mean absolute percentage error of about 10 %for the lattice lengths, although they encountered difficulties in accurately predicting angles. In 2024, Zhang et al.introduced a convolutional self-attention neural network trained on simulated patterns to classify crystal types[20]. Their model was tested on 23,073 unary, binary, and ternary inorganic crystal structures sourced from the COD. The study observed a noticeable performance drop when the pre-trained model was applied to real experimental patterns as opposed to simulated data. However, their recent work[21]proposes using convolutional peak descriptors that consider the detector’s geometry, which reduces the performance gap in their benchmark tests. Neural networks trained purely on experimental diffractograms can perform well when the range of samples is narrow and the data is collected only on a single machine[13,42]. However, in a more general setting with a wide range of investigated samples and employed diffractometers training neural networks purely on experimental diffractograms becomes infeasible. This is because of the limited availability of labeled experimental diffractograms relative to the scope of the task. However, in 2023, Salgado et al.[43]showed that adding a fraction of experimental patterns to a simulated training dataset improves the performance on unseen experimental patterns. They used 50 %of the experimental patterns contained in the RRUFF database and added those to their large simulated training set. Then they tested their model’s performance on the other half of the RRUFF database and achieved a performance increase in the 230-way spacegroup classification accuracy of 11percentage points compared to the same model only trained on simulated patterns. In 2024, Schuetzke et al.trained a classifier to classify if a diffractogram stems from an amorphous, single-phase, or multi-phase sample[44]. Due to the lack of experimental pXRDs, they built a pipeline to augment simulated diffractograms of a reference structure by, among other things, slightly varying the underlying crystal lattice. For spinel structures, they reported an accuracy of 100 %but they also proved that their approach can be transferred to other datasets. 5 Page 6: In 2023, Schopmans et al.presented an approach to generate synthetic crystal structures and their corresponding pXRD patterns on the fly during the training process[14]. This approach defeats the issue of a limited dataset size, which limits the depth of neural networks that can be trained. However, the accuracy dropped substantially when we applied our space group classification model to experimental patterns from the RRUFF database. Augmenting our simulated patterns with background, noise, and impurities helps to bring simulated diffractograms closer to experimental ones, making models trained on them more performant on experimental diffractograms. However, this augmentation process could be improved by incorporating background and noise statistics from a broader experimental pXRD database, such as the one presented in this publication. It becomes apparent that the more general the task is, the more challenging the transfer to exper- imental data becomes. For example, the space group classification task across all material systems is very general. Therefore, transferring it to the application on experimental diffraction patterns is difficult.[14,23,41]On the other hand, there are some successful approaches that also work well on experimental data, but those are mostly methods that do phase determination in a limited compound space, making the task less complex[40,44]. The current volume of experimental pXRD patterns is insufficient to effectively train ML models, highlighting anurgentneed fora comprehensive experimentalpXRD database. The mostadvanced ML models currently are trained on approximately 105−106simulated diffractograms[14,43]. This is, to the best of our knowledge, two orders of magnitude larger than the largest currently curated experimental dataset, the PDF-5+ with approximately 2·104experimental patterns. It is even one order of magnitude larger than the approximately 105unlabeled diffractograms in the initial version of the opXRD dataset we present here. TomakeML-basedpXRDdataidentificationpracticalforexperimentaluseandautomatestructure prediction despite lacking experimental training data two key approaches are essential. First, developing more sophisticated simulation methods to better approximate experimental patterns[21] by using statistics from experimental diffractograms. Second, creating an experimental database that enables transfer learning to bridge the gap between simulated and real-world data. For both of these steps, the development of opXRD is particularly significant, as it will provide a comprehensive experimental benchmark for the community, allowing fair comparison of baseline models and accurate evaluation of their applicability in real experimental situations. 2 Existing datasets To contextualize opXRD within the current environment of experimental powder diffraction data, the list below provides an overview of the largest crystal structure databases that offer access to experimental powder diffraction data. For an overview of these databases refer to Tab. (1) below. Table 1: Overview of experimental powder diffraction databases: The column “O.A.” indicates whether or not the database is open-access. The availability of the chemical composition, spacegroups, lattice parameters, and atomic coordinates of the underlying samples are indicated by the columns “Comp.”, “Spg.”, “Lattice” and “Atom coords.”, respectively. Name No. patterns O.A. Comp. Spg. Lattice Atom coords. Year est. Linus Pauling file 21,700 ✕ ✔ ✔ ✔ ✔ 2002 Powder Diffraction File120,800 ✕ ✔ ✔ ✔ ✔ (52%) 1941 RRUFF 1290 ✔ ✔ ✔ ✔ ✕ 2006 Crystallography Open Database 1052 ✔ ✔ ✔ ✔ (85%) ✔(85%) 2003 PowBase 169 ✔ ✔ ✕ ✕ ✕ 1999 Linus Pauling File :[45]The Linus Pauling File is a largely commercial crystal structure database published and maintained by the Pauling File project[29]. It is currently distributed as Pearson Crystal data[46]and the Materials Platform for Data Science (MPDS)[47]. The database, first pub- lishedin2002, currentlycontainsmorethan534,000crystalstructures[47]and21,700corresponding experimental powder diffraction patterns[46]. This makes the Pauling file, to the best of our knowl- edge, the largest collection of experimental powder diffraction data available to researchers. As of 1The PDF lists the Material Platform for Data Science (MPDS) as a database source. Since the MPDS is hosted by the Pauling File project, there is likely significant overlap in the experimental patterns available in the PDF and the Linus Pauling File. 6 Page 7: November 2024, Pearson’s crystal data is available to researchers through a purchase of a one-year license starting at a price point of 2200e[48]. The MPDS is partially open, with the open portion of the MPDS data accessible through a web interface[47]. API access to the full MPDS can be purchased through a one-year license starting at 9500e[49]. We asked the Pauling File project whether the experimental powder diffraction data is accessible through the MPDS API. The Paul- ing File project responded that this data is not currently provided through the API, but could be offered in the future at the request of customers. Powder Diffraction File:[50]The Powder Diffraction File (PDF), published and maintained by the International Center for Diffraction Data (ICDD), is a large collection of materials with accompanying powder diffraction data first published in 1941[28]. According to the ICDD the latest release of the PDF, the PDF5+, contains over a million materials with accompanying powder diffraction data. However, since most of these powder diffraction patterns are simulated we asked the ICDD about the number of experimental diffraction patterns in the PDF5+. We were told that 20,800 of the patterns in the PDF5+ stemmed from experiments and that 10,954 of these patterns were accompanied by the atomic coordinates of the underlying structures. Since the PDF5+ lists the MPDS as a database source, there is likely a significant overlap in the experimental patterns found in the PDF5+ and those found in the Pauling file. Currently, the PDF5+ is available to researchers through a purchase of a one-year license starting at a price point of $6265. However, the ICDD does not allow researchers to train machine learning models on PDF5+ data, regardless of whether the resulting models are published[51]. RRUFF :[52]The RRUFF Mineral Database, first published in 2006, provides detailed informa- tiononminerals,includingtheirchemicalcompositions,crystallography,andspectroscopicdata[27]. Managed by the University of Arizona, it was created to serve as a public repository for mineral identification and research. It contains 1290powder diffraction patterns stemming from experi- ments each labeled with the lattice parameters and composition of the underlying structures. The RRUFF data is openly accessible on its official website[52]. Crystallography Open Database:[53]The Crystallography Open Database (COD) is an open- access collection of crystal structures founded in 2003[54]. It currently provides over 500,000 crystal structures. Of these files, 1052 contains the experimental powder diffraction data that was used to determine the underlying crystal structures of the investigated samples. Hence, the experimental powder diffraction data contained in the COD is mostly labeled with the full crystal structure information. The data is openly accessible in the form of .cif files on the official COD website[53]. PowBase:[55]PowBase is a database of 169 mostly unlabeled experimental powder diffraction pat- terns collected and maintained by crystallography researcher Armel Le Bail starting in 1999. Pow- Base is an initiative suggested in the Structure Determination by Powder Diffractometry (SDPD) mailing list which was co-maintained by Le Bail. The COD is another community initiative that grew out of this mailing list. As of March 2025, all 169 patterns are still freely available for download on the official website[55]. There is also publicly available powder diffraction data uploaded to datasets on Zenodo. However, this data is split into disparate entries that typically only contain the work of a single research project. Additionally, extracting powder diffraction data at scale is hindered by the fact that the data is often given in plain text files in non-standardized formats, which are difficult to parse automatically. We are currently planning a systematic large-scale extraction of powder diffraction datafromdatabaseslikeZenodowiththehelpofalargelanguagemodel. Thisdatawillbeincluded in a future release of the opXRD database. While not strictly speaking a powder diffraction database, the High-Throughput Experimental Materials Database (HTEM) by the National Renewable Energy Laboratory (NREL) is a valuable source of X-ray diffraction data[56]. Currently, the HTEM database contains 65,779 thin-film samples with corresponding X-ray diffraction data[57]. Each database entry includes the elemental compositionoftheunderlyingsamplebutdoesnotprovideanyinformationonitsstructure. HTEM data is open-access and can be downloaded through an API provided by NREL. Aside from the databases mentioned above, we have also investigated several other crystal struc- ture resources in search of experimental powder diffraction data. Crystal structure resources that were investigated but not found to contain any appreciable amount of publicly available experimen- tal powder diffraction data include the Inorganic Crystal Structure Database[58], the Cambridge 7 Page 8: Structural Database[59], the Materials Project database[60], the Crystallographic and Crystallo- chemical Database[61], the Bilbao Incommensurate Crystal Structure Database[62], the Mineralogy Database[63], the IUCr Raw data letters[64], the U.S. Naval Research Laboratory Crystal Lattice- Structures[65], the Athena Mineral database[66]and the Protein data bank[67]. The lack of exper- imental powder diffraction data in these databases is to be expected as most structure solutions are achieved through single-crystal diffraction. 3 opXRD database In collaboration with several other research institutions, we have collected a database of 92,552 experimental patterns of which 2179 are at least partially labeled with structural information of the underlying material. The following research institutions contributed data to the opXRD database: The French National Centre for Scientific Research (CNRS), Hong Kong University of Science and Technology (Guangzhou) (HKUST), University of Southern California (USC), Lawrence Berkeley National Laboratory (LBNL), Empa–Swiss Federal Laboratories for Materials Science and Tech- nology (EMPA) and the Karlsruhe Institute of Technology (KIT). Tab. (2) provides an overview of the contributions of each institution. We filtered the submitted datasets to exclude patterns with invalid features such as only one unique recorded angle, negative angles, less than 50 recorded angles total, or all intensities being zero. Table 2: Overview of the contributions to the opXRD database: The availability of the chemical composition, spacegroups, lattice parameters, and atomic coordinates of the underlying samples are indicated by the columns “Comp.”, “Spg.”, “Lattice” and “Atom coords.” respectively. Institution No. patterns Comp. Spg. Lattice Atom coords. Research Project CNRS 1052 ✔ ✔ (85%) ✔ ✔ (85%) Diffraction data extracted from the COD USC 338 ✔ ✔ ✔ (90%) ✕ Study of CuNi and CuAl alloys HKUST(GZ) 520 ✔(4%) ✔(4%) ✔(4%) ✔(4%) Phase identification dataset EMPA 770 ✔ ✔ (63%) ✕ ✕ Metal halide perovskites, Zn-V-N libraries INT 19,796 ✕ ✕ ✕ ✕ Compilation of various projects IKFT 64 ✕ ✕ ✕ ✕ Commercial catalysts, metals, metal oxides LBNL 70,012 ✕ ✕ ✕ ✕ Perovskites precursors, Mn-Sb-O system The variance of the data was analyzed using principal component analysis (PCA). PCA can be applied to datasets X⊂RNto reduce the number of components needed to describe points p∈X up to some tolerance in lost accuracy. In the context of PCA, the cumulative explained variance ratio is a measure of how much of the variance in the dataset Xcan be explained using a given number of components. For a rigorous definition of PCA and the explained variance ratio, we refer to the literature[68]. Here, PCA was performed on datasets of X-ray diffraction patterns. These datasets Xare subsets of RNwith N= 512since each pattern p∈Xwas standardized to have 512 intensity values spread out evenly from 0°to180 °using zero padding and interpolation with cubic splines. Hence the maximal components that could be needed to describe a dataset of diffraction data in this context is N= 512. However, the maximal number of components is even lower for datasets that contain less than 512 patterns. In this case, the maximal number of components is equal to the number of patterns in the dataset since each pattern can add at most one degree of freedom to the dataset X∈RN. Hence the maximum number of components Nmaxof a pattern dataset Xis given as follows: Nmax= min( Nvalues, Npatterns ). (1) Here Nvalues = 512is the number of recorded intensity values per pattern and Npatternsis the number of patterns in the dataset X. Fig (3) below shows the cumulative explained variance ratio over the fraction of maximal No. components Nmaxas defined above. In this figure, a faster convergenceofthecumulativevarianceratiotowardsoneindicatesthatthepatternsinthisdataare relatively similar. The degree of variation between the patterns is different for each contribution. For example, the CRNS and the HKUST contributions each are collections that encompass many research projects over a large period of time and thus exhibit a high degree of variability between individual patterns. In contrast, the contributions by USC and LBNL contain many very similar patterns. The patterns in the USC dataset are similar because the underlying samples are all variations of CuNi and CuAl alloys. The patterns submitted by LBNL are similar because they 8 Page 9: stem from in-situ recordings where several hundred or several thousand patterns were collected over time per sample while they were undergoing physical conversion processes. Figure3: Explainedvarianceratiooverthefractionofthemaximumnumberofcomponentsforeach dataset contributed to the opXRD database. Here the maximal No. components refers to Nmaxas defined in equation (1). Datasets contributed by the same institution are labeled alphabetically in the order in which they are described in the texts towards the end of this section. Fig. (4) provides an overview of the distributions of pattern and structure properties in the opXRD database. Nearly all patterns have an angular resolution smaller than ∆(2θ) = 0 .1◦. Here the angular resolution is defined as the range of recorded angles divided by the number of recorded intensity values along that range. For most patterns, the lowest recorded angle is smaller than 30◦ and the highest recorded angle is smaller than 120◦. The start-to-end angle distribution reveals that all diffractograms start in a narrow window between 0°and approximately 50°, while they end between 50°and150 °, with the majority of patterns going from 0°to approximately 70°. Unlike most ML approaches using synthetic data over the full angle range with fixed resolution, the opXRD dataset has a strongly varying angle range and resolution. Hence, working with this data requires additional pre-processing methods such as padding and interpolation, or more flexible ML models beyond standard CNNs. In the following, we will describe the datasets contributed by each of the collaborating research groups and institutions. Each paragraph includes a description of the investigated materials and how X-ray diffraction data was collected. If applicable, the presence of thin-film samples or atypical diffraction geometries is indicated. Most data was collected using Cu radiaton sources which has aKα1wavelength of λ= 1.54056Å and a Kα2wavelength of λ= 1.54439Å. Institut de Recherche de Chimie Paris, CNRS Experimental pXRD data was extracted from the Crystallography Open Database (COD)[69,70]. The COD is, to our knowledge, the largest open-access collection of experimental crystal structures of organic, inorganic, and metal-organic compounds and minerals, containing more than 500,000 entries. The data in the COD are placed in the public domain and licensed under the CC0 License. Of the entire COD database 5432 structures contained at least one tag from the CIF_POW dictionary, i.e., a tag relating to powder diffraction studies. These 5432 structures only account for 1% of the total COD database, but this is to be expected since most crystal structures are resolved from single-crystal diffraction. Of these 5432 files, most contained only metadata related to the powder diffraction experiment, but did not include the raw data of the pattern itself. We could extract raw experimental pXRD patterns from 1052 files in total, after curation of a small number of files with clearly invalid data. The pXRD data from the COD database are of high quality, with a median resolution of ∆(2θ) = 0.013 °and an average number of 9190 points measured per pattern. They span a wide chemical space, including organic, inorganic, and hybrid structures, and 75 different elements of the periodic table. 9 Page 10: Figure 4: Histograms detailing the distribution of pattern and structure properties in the opXRD database: a) distribution of spacegroups present in labeled data; b) distribution of angular resolu- tion in all data; c) distribution of smallest and largest recorded 2θvalues for all data. Guangzhou Municipal Key Laboratory of Materials Informatics, HKUST(GZ) Two datasets were contributed to the opXRD database. The first dataset (HKUST-A) is a selected subset of a small-scale experimental powder X-ray database developed over the past two years, called the X-Ray Phase Identification Public Experimental Dataset (XRed) ( https: //github.com/WPEM/XRED ). The primary goal of XRed is to support the advancement of in- telligent phase identification technology by providing a foundation for data collection in future large-scale machine learning applications. XRed primarily focuses on metal and metal-oxide par- ticles, with data collected using diffractometers such as the Empyrean 3.0, Aeris, and Bruker D8 Advance, all employing Cu X-ray sources. The dataset HKUST-A contains 21 pXRD patterns each labeled with a corresponding CIF file that documents the refined structure. Data are categorized by elemental systems and include original experimental files, spanning single-phase to five-phase mixtures, as well as mixtures designed for various research tasks. In addition to XRed, the opXRD database integrates an experimental dataset composed of powder diffraction data sourced from open-access publications and collaborating institutions (HKUST-B). These institutions have provided the data with full authorization for research purposes. Compared to XRed, this dataset offers broader chemical element coverage, encompassing ionic, atomic, and metallic crystals. It is also larger, containing 499 entries. However, unlike XRed, these data entries are not accompanied by CIF files. Laboratory for Surface Science and Coating Technologies, Empa Combinatorial Zn–V–N libraries were synthesized using radio-frequency co-sputtering of Zn and V in a mixed Ar and N 2plasma. An orthogonal deposition temperature and composition gradient was created, resulting in a deposition temperature of 220 °C for samples 1 – 9 and 114 °C for samples 37 – 45. The composition for each sample was determined using X-ray fluorescence (XRF) spectroscopy which was further calibrated through Rutherford backscattering spectroscopy (RBS) based on selected samples. The newly identified and isolated semiconductor Zn 2VN3was identified to exhibit a cation-disordered wurtzite structure as verified by additional GI-XRD and SAED measurements[71]. 10 Page 11: Tin halide perovskites were deposited using single-step spin-coating as reported elsewhere[72]. Methylammonium lead iodide libraries with varying degrees of residual PbI 2were deposited using a two-step procedure involving both thermal evaporation of PbI 2and subsequent spin-coating of a methylammonium solution. The relative phase fractions were quantified using supplementary azimuthal angle scans coupled with structural factors and geometrical factors as reported else- where[73]. Fully inorganic lead perovskite libraries were prepared using thermal co-evaporation of lead and cesium halide salts. All metal halide perovskite libraries were measured within a custom- made X-Ray transparent inert-gas dome, resulting in the presence of minor additional features within the θ= 19–31°range. For all combinatorial libraries where any phases are specified, the complete set of phases is reported in the metadata. XRD data was measured using a Bruker D8 Discover equipped with a Cu radiation source in a Bragg-Brentano geometry. For the reported data sets the instrument was equipped with a Goebel mirror effectively removing the Cu K βradiation. The data set originates from the combinatorial exploration of the Zn–V–N compositional space, as well as data gathered from multiple research activities on more established metal halide perovskite semiconductors. All data was collected from thin films deposited on borosilicate glass. The Zn–V–N films showed some preferential out-of- plane orientation, while for the perovskites the preferential orientation was minimal, resulting in the presence of all reflections. Institute of Nanotechnology, KIT X-ray diffraction data was collected from a wide range of research projects conducted at the Insti- tuteofNanotechnologyoverthepast10years. Amajorpartoftheresearchfocusedonhigh-entropy materials, which involved incorporating many different elements into single-phase structures, lead- ingtopeakshiftsorphaseseparations. Mostofthosemulti-componentcomplexmaterialsappeared in various structures, including rock-salt, spinel, fluorite, perovskite, and delafossite. The samples were prepared either in powder or in bulk form; therefore, powder XRD was performed on samples with adjusted height. The samples were prepared using various synthesis techniques, mostly solid- state or wet chemical syntheses, to obtain the desired structures. Consequently, particle size and crystallinity varied significantly. The sample set also includes samples that were not successfully measured or where phases could not be identified. The X-ray diffraction data were collected on a Bruker D8 Advance using a Cu radiation source or a STOE Stadi P diffractometer equipped with a Ga-jet X-ray source. The samples were initially recorded for various research projects over the last ten years and were measured with different step sizes, times per step, and over different angle ranges, but all using Cu Kαor Ga Kβradiation. The samples mostly contained transition metal oxides, sulfides, and fluorides. To improve statistics, the samples were rotated during the entire measurement. Some air-sensitive samples were measured using a transparent polymer dome for protection. This dome led to increased background noise over the first 20°and slightly decreased pattern resolution. Institute of Catalysis Research and Technology, KIT Avarietyofsampleswereanalyzedincludingcommercialcatalysts, bulkreferencematerials, porous metal oxide particles, and nanoparticles. The latter were synthesized via the surfactant-free benzyl alcohol route[74,75]. The cobalt oxide (CoO or Co 3O4) and cerium oxide (CeO 2) nanoparticles were in the size range of 4−16 nmaccording to the Scherrer equation. A series of porous Al 2O3 materials, which were prepared by calcination of boehmite (AlOOH) at various temperatures, represents crystalline samples with limited long-range structure and various contributions of Al 2O3 polymorphs. X-ray diffraction (XRD) was conducted with an X’Pert Pro MPD (Panalytical) in Bragg-Brentano geometry using a Cu X-ray source. The patterns were acquired in the 2θrange of 5−80°with a step size of 0.016711 °or0.033420 °and a total acquisition time of 40 to 120 min. This study has been carried out with the support of Angelina Barthelmeß, Elisabeth Herzinger, and Henning Hinrichs. 11 Page 12: Molecular Foundry Division & Advanced Light Source & Chemical Sciences Division, LBNL In total four different datasets were collected. The first dataset (LBNL-A) was collected from spin-coating and annealing triple-cation metal-halide perovskite precursor solutions with the com- position Cs 0.05(MA 0.23FA0.77)Pb 1.1(I0.77Br0.23)3onto various substrates. Here, MA stands for Methylammonium and FA stands for Formamidinium. The substrates onto which these solutions were coated include glass, which is amorphous, and GaAswafers, which are single crystalline. Other substrates were stacks of glass/indium tin oxide, stacks of GaAs/CIGS, and stacks of glass/- CIGS. Here, CIGS stands for a stack of Mo, Cu(In, Ga)Se 2,Cds and ZnO. Some of the substrates were additionally covered with a self-assembling monolayer of MeO-2PACz. The GaAssubstrates were prepared by Dr. Jiro Nishinaga from the National Institute of Advanced Industrial Science and Technology (AIST) in Japan[76]and the glass/CIGS substrates by Dr. Christian Kaufmann and his team at Helmholtz-Zentrum Berlin (HZB) in Germany[77]. Data collection was performed in situ during thin-film deposition using a custom-made spin-coating and annealing stage[78]. A second dataset (LBNL-B) was collected from spin-coating metal-halide perovskite precursor solutions with varying compositions of MAPb(I 1–xBrx)3spin-coated onto glass substrates. Here, MA = Methylammonium and x = 0, 0.33, 0.5, 0.67, 1. The substrates were preheated to different temperatures including 30°C,50°C,70°C, and 90°C, and the spin-coating process was performed at a constant temperature on the preheated substrates. For both datasets, diffraction data were continuously measured during spin-coating, chemical induction of crystallization, and annealing of thesamples, at 100 °Cand 110 °Crespectively. Thediffractiondatawasrecordedwithafrequencyof about 0.56 1/sand0.54 1/s. Each in situ measurement consisted of about 500 to 1000 individual diffractograms. Depending on the substrate, each series of diffractograms shows an evolution from substrate only to a combination of polycrystalline perovskite, PbI 2and substrate via several intermediate phases. For these two datasets, experimental XRD data were collected at beamline 12.3.2 of the Advanced Light Source, the synchrotron at Lawrence Berkeley National Laboratory. The data were col- lected using a photon energy of 10 keV(λ= 1.23984Å), selected using a Si(111) monochromator. Measurements were taken in grazing incidence geometry, i.e. using a beam incidence angle of 1°. Two-dimensional diffraction images were recorded using a Dectris Pilatus 1M area detector at an angle between 34°and36°with a sample-to-detector distance of roughly 190 mm. The two- dimensional data were calibrated using an Al 2O3calibration standard and integrated along the azimuthal angle. A third dataset (LBNL-C) was collected by observing the phase evolution of an Mn-Sb-O system with varying annealing temperatures. The temperatures used to analyze the crystal structure of the Mn-Sb-O system were chosen depending on the number of phase transitions appearing for a certain temperature range. Few changes in the crystal structure appear between room temperature and 300 °C and phase transitions appeared from 300 °C until 850 °C. No phase transition appeared when cooling down. Therefore, the crystal structure was measured every 100 °C between room temperature and 300 °C; every 50 °C between 300 °C and 850 °C; and every 200 °C when cooling down. The heating and cooling rates were fixed for all the experiments at 50 °C/min and the holding time was fixed to 2 min. This data was collected using the in situRigaku-SmartLab3kW diffractometer. This tool operates with SmartLab Studio II software, which can measure the X-ray diffraction during the annealing process. This enables directly showing all the phase transitions when annealing in various atmo- spheres such as O 2, Ar, and NH 3. Phase transitions are analyzed with the in situXRD tool up to 850 °C in this work. Most of the in situexperiments were performed under an air-like 20% O2and 80% Ar environment is chosen (Ar flow: 50 sccm, O2flow: 10 sccm). When a 100% Ar environment is fixed, an Ar flow of 60 sccmis input. The Bragg-Brentano (BB) mode is preferred in terms of geometry because it is more adapted in the analysis of scarce phases such as MnSb 2O6 rutile. The angular step used in the recording was 0.01 °and the scanning rate was 10 °/min. A fourth dataset (LBNL-D) was collected from a two-step spin-coating process using metal-organic frameworks (MOFs) in perovskite precursor solutions, deposited onto glass substrates. In the first step, a nanoscale thiol-functionalized UiO-66-type Zr-based MOF (UiO-66-(SH) 2) was added to the PbI 2precursor. This was followed by the deposition of an organic mixture solution containing 12 Page 13: FAI, MACl, and MABr in the second step. The incorporation of MOFs aids in suppressing per- ovskite vacancy defects, thereby enhancing device stability and efficiency. To further investigate the influence of UiO-66-(SH) 2) on perovskite thin-film formation during the annealing process, a time-resolved GIWAXS experiment was conducted. The measurements were performed using a setup similar to that of LBNL-A and B. 4 Usage The opXRD database is hosted on Zenodo ( https://zenodo.org/records/14254270 ) and can be downloaded by any user without any barriers or restrictions. Next to the availability of the opXRD dataset on Zenodo, we also provide a Python library “opxrd” to easily download and interface with the dataset. The instructions for how to install this library can be found in the repository associated with the library. The repository to this library is located at https://github.com/aimat-lab/opxrd . The opxrd library includes op- tions for data-loading, standardization, plotting, and the conversion to PyTorch tensors. We provide a Jupyter Notebook ( https://colab.research.google.com/github/aimat-lab/opXRD/ blob/main/opxrd/usage.ipynb ) that showcases these functionalities in more detail. This note- book also illustrates how to interface with the opXRD database through Python. 5 Summary and Outlook With the opXRD database, a curation of 92,552unlabeled and 2179 at least partially labeled experimental powder X-ray diffraction patterns from a wide range of different materials systems, we provide the largest currently available source of experimental XRD patterns. With this, we address the need for experimental data that arises when developing algorithms and analysis tools for pXRD data, both based on machine learning and classical approaches. The data can be used for the actual method development and for testing. Our dataset is a valuable and so far missing resource to drive further developments in the automated analysis of XRD data. Rather than a finished project, the opXRD database is an ongoing effort to collect experimental powder XRD data. We invite everyone who is working in the area of experimental powder XRD to submit it to the dataset, in order to further improve the utility of the dataset and thus aid further developments in this field. Our submission page ( https://xrd.aimat.science/ ) and submission helper software will be kept available to collect more data. We will keep updating and maintaining the dataset with new incoming submissions. Data availability The opXRD database is available on Zenodo at https://zenodo.org/records/14254270 . It is publishedundertheCreativeCommonsAttribution4.0Internationallicense. Itcanbedownloaded by any user without any barriers or restrictions. For further details, please refer to Section (4). Conflicts of interest There are no conflicts of interest to declare. Acknowledgements H.S. acknowledges financial support by the German Research Foundation (DFG) through the Re- search Training Group 2450 “Tailored Scale-Bridging Approaches to Computational Nanoscience”. P.F. and D.H. acknowledge support by the Federal Ministry of Education and Research (BMBF) under Grant No. 01DM21001B (German-Canadian Materials Acceleration Center). J.Oe. and P.F. acknowledge financial support from the Helmholtz Foundation Model Initiative within Project "SOL-AI". Part of this work was funded under the France 2030 framework by Agence Nationale de la Recherche (project ANR-22-PEXD-0009 of PEPR DIADEM). Work at the Molecular Foundry was supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Work at the Advanced Light Source (ALS) was done at beamline 12.3.2. The ALS is a DOE Office of Science User Facility under contract no. 13 Page 14: DE-AC02-05CH11231. The development of the online phase identification platform is supported by the Guangzhou-HKUST(GZ) Joint Funding Program (No. 2023A03J0003). Work by the USC group was supported by the National Science Foundation (NSF) grant numbers DMR-2227178 and OISE-2106597. M.W. acknowledges funding by the Helmholtz Research Program “Materials and Technologies for the Energy Transition (MTET), Topic 3: Chemical Energy Carriers". Work by the Empa group was supported by the Strategic Focus Area–Advanced Manufacturing (SFA–AM) through the project Advancing manufacturability of hybrid organic–inorganic semiconductors for large area optoelectronics (AMYS) as well as the Empa internal research call 2020. We thank BW- Cloud, funded by the Ministry of Science, Research and Arts Baden-Württemberg, for providing cloud server infrastructure. References [1] Yihao Liu, Ziheng Hu, Zhiguang Suo, Lianzhe Hu, Lingyan Feng, Xiuqing Gong, Yi Liu, and Jincang Zhang. High-throughput experiments facilitate materials innovation: A review. Science China Technological Sciences , 62:521–545, 2019. doi:10.1007/S11431-018-9369-9. [2] B. P. MacLeod, F. G. L. Parlane, T. D. Morrissey, F. Häse, L. M. Roch, K. E. Dettelbach, R. Moreira, L. P. E. Yunker, M. B. Rooney, J. R. Deeth, V. Lai, G. J. Ng, H. Situ, R. H. Zhang, M. S. Elliott, T. H. Haley, D. J. Dvorak, A. Aspuru-Guzik, J. E. Hein, and C. P. Berlinguette. Self-driving laboratory for accelerated discovery of thin-film materials. Science Advances , 6(20):eaaz8867, 2020. doi:10.1126/sciadv.aaz8867. [3] A. Ludwig. Discovery of new materials using combinatorial synthesis and high-throughput characterization of thin-film materials libraries combined with computational methods. npj Computational Materials , 5:1–7, 2019. doi:10.1038/s41524-019-0205-0. [4] Yoshihiko Ozaki, Yuta Suzuki, T. Hawai, Kotaro Saito, Masaki Onishi, and K. Ono. Auto- matedcrystalstructureanalysisbasedonblackboxoptimisation. npj Computational Materials , 6:1–7, 2020. doi:10.1038/s41524-020-0330-9. [5] Robert E. Dinnebier, Andreas Leineweber, and John S. O. Evans. Rietveld Refinement: Prac- tical Powder Diffraction Pattern Analysis using TOPAS . De Gruyter, 2019. ISBN 978-3-11- 045621-9. [6] Diego Alberto Flores Cano, Anais Roxana Chino Quispe, Renzo Rueda Vellasmin, Joao An- dreOcampoAnticona, J.González, andJ.A.RamosGuivar. Fiftyyearsofrietveldrefinement: Methodology and guidelines in superconductors and functional magnetic nanoadsorbents. Re- vista de Investigación de Física , 2021. doi:10.15381/rif.v24i3.21028. [7] L. B. McCusker, R. B. Von Dreele, D. E. Cox, D. Louër, and P. Scardi. Ri- etveld refinement guidelines. Journal of Applied Crystallography , 32(1):36–50, 2 1999. doi:10.1107/s0021889898009856. [8] Ankit Agrawal and A. Choudhary. Deep materials informatics: Applications of deep learning in materials science. MRS Communications , 9:779–792, 2019. doi:10.1557/MRC.2019.73. [9] Vasile-AdrianSurduandRomualdGyőrgy. X-raydiffractiondataanalysisbymachinelearning methods—a review. Applied Sciences , 2023. doi:10.3390/app13179992. [10] Zhenjie Feng, Q. Hou, Y. Zheng, W. Ren, Junyi Ge, Tao Li, Cheng Cheng, Wencong Lu, S. Cao, Jincang Zhang, and Tong-Yi Zhang. Method of artificial intelligence algorithm to improve the automation level of rietveld refinement. Computational Materials Science , 2019. doi:10.1016/J.COMMATSCI.2018.10.006. [11] Hong Wang, Yunchao Xie, Dawei Li, Heng Deng, Yun-Zhi Zhao, Ming Xin, and Jian Lin. Rapid identification of x-ray diffraction patterns based on very limited data by inter- pretable convolutional neural networks. Journal of chemical information and modeling , 2020. doi:10.1021/acs.jcim.0c00020. [12] W. Park, Jiyong Chung, Jaeyoung Jung, Keemin Sohn, S. Singh, M. Pyo, N. Shin, and K. Sohn. Classification of crystal structure using a convolutional neural network. IUCrJ, 4: 486 – 494, 2017. doi:10.1107/S205225251700714X. 14 Page 15: [13] Byung Do Lee, Jin-Woong Lee, Junuk Ahn, Seonghwan Kim, W. Park, and K. Sohn. A deep learning approach to powder x-ray diffraction pattern analysis: Addressing generaliz- ability and perturbation issues simultaneously. Advanced Intelligent Systems , 5:2300140, 2023. doi:10.1002/aisy.202300140. [14] Henrik Schopmans, Patrick Reiser, and Pascal Friederich. Neural networks trained on syn- thetically generated crystals can extract structural information from icsd powder x-ray diffrac- tograms. Digital Discovery , 2(5):1414–1424, 2023. ISSN 2635-098X. doi:10.1039/d3dd00071k. [15] Di Chen, Yiwei Bai, Sebastian Ament, Wenting Zhao, Dan Guevarra, Lan Zhou, Bart Selman, R. Bruce van Dover, John M. Gregoire, and Carla P. Gomes. Automating crystal-structure phase mapping by combining deep learning with constraint reasoning. Nature Machine Intel- ligence, 3(9):812–822, September 2021. ISSN 2522-5839. doi:10.1038/s42256-021-00384-1. [16] Ming-Chiang Chang, Sebastian Ament, Maximilian Amsler, Duncan R. Sutherland, Lan Zhou, John M. Gregoire, Carla P. Gomes, R. Bruce van Dover, and Michael O. Thompson. Probabilistic Phase Labeling and Lattice Refinement for Autonomous Material Research. (arXiv:2308.07897), August 2023. doi:10.48550/arXiv.2308.07897. [17] H.Dong,K.Butler,D.Matras,S.W.T.Price,Y.Odarchenko,RahulKhatry,AndrewThomp- son, V. Middelkoop, S. Jacques, A. Beale, and A. Vamvakeros. A deep convolutional neural network for real-time full profile analysis of big powder diffraction data. npj Computational Materials , 7:1–9, 2021. doi:10.1038/s41524-021-00542-4. [18] Sathya R. Chitturi, Daniel Ratner, Richard C. Walroth, Vivek Thampy, Evan J. Reed, Mike Dunne, Christopher J. Tassone, and Kevin H. Stone. Automated prediction of lattice param- eters from x-ray powder diffraction patterns. Journal of Applied Crystallography , 54:1799 – 1810, 2021. [19] S.Habershon, E.Cheung, K.Harris, andR.Johnston. Powderdiffractionindexingasapattern recognition problem: A new approach for unit cell determination based on an artificial neural network. Journal of Physical Chemistry A , 108:711–716, 2004. doi:10.1021/JP0310596. [20] Shouyang Zhang, Bin Cao, Tianhao Su, Yue Wu, Zhenjie Feng, Jie Xiong, and Tong-Yi Zhang. Crystallographic phase identifier of a convolutional self-attention neural network (cpicann) on powder diffraction patterns. IUCrJ, 11(Pt 4):634, 2024. [21] Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, and Tong-yi Zhang. Simxrd-4m: Big simulated x-ray diffraction data and crystal symmetry classification benchmark. arXiv preprint arXiv:2406.15469 , 2024. [22] Felipe Oviedo, Zekun Ren, Shijing Sun, C. Settens, Zhe Liu, N. T. P. Hartono, Savitha Ramasamy, Brian L. DeCost, S. Tian, Giuseppe Romano, A. Gilad Kusne, and T. Buonassisi. Fastandinterpretableclassificationofsmallx-raydiffractiondatasetsusingdataaugmentation and deep neural networks. npj Computational Materials , 5:1–9, 2018. doi:10.1038/s41524-019- 0196-x. [23] Pascal M. Vecsei, Kenny Choo, Johan Chang, and T. Neupert. Neural network based clas- sification of crystal symmetries from x-ray diffraction patterns. Physical Review B , 2018. doi:10.1103/PhysRevB.99.245120. [24] A. N. Zaloga, V. V. Stanovov, O. E. Bezrukova, P. S. Dubinin, and I. S. Yaki- mov. Crystal symmetry classification from powder x-ray diffraction patterns using a convolutional neural network. Materials Today Communications , 25:101662, 2020. doi:10.1016/j.mtcomm.2020.101662. [25] Yuta Suzuki, H. Hino, T. Hawai, Kotaro Saito, M. Kotsugi, and K. Ono. Symmetry predic- tion and knowledge discovery from x-ray diffraction patterns using an interpretable machine learning approach. Scientific Reports , 10:21790, 2020. doi:10.1038/s41598-020-77474-4. [26] Abhik Chakraborty and Raksha Sharma. A deep crystal structure identification system for x-ray diffraction patterns. The Visual Computer , 38:1275 – 1282, 2021. doi:10.1007/s00371- 021-02165-8. 15 Page 16: [27] Barbara Lafuente, R. T. Downs, H. Yang, and N. Stone. 1. The power of databases: The RRUFF project . De Gruyter, 11 2015. doi:10.1515/9783110417104-003. [28] S. Gates-Rector and T. Blanton. The powder diffraction file: a quality materials characteri- zation database. Powder Diffraction , 34:352 – 360, 2019. doi:10.1017/S0885715619000812. [29] Pierre Villars, Karin Cenzual, Roman Gladyshevskii, and Shuichi Iwata. PAULING FILE - towards a holistic view. Chemistry of Metals and Alloys , 11(3/4):43–76, 1 2018. doi:10.30970/cma11.0382. [30] Thomas Armbruster and Rosa Micaela Danisi, editors. Highlights in Mineralogical Crystal- lography. De Gruyter, 2015. ISBN 9783110417104. [31] Yoshio Waseda, Eiichiro Matsubara, and Kozo Shinoda. X-Ray Diffraction Crystallography . Springer, 2011. doi:10.1007/978-3-642-16635-8. [32] Vitalij Pecharsky and Peter Zavalij. Fundamentals of Powder Diffraction and Structural Char- acterization of Materials . Springer, 2023. doi:10.1007/b106242. [33] Benjamin S. Hulbert and Waltraud M. Kriven. Specimen-displacement correction for powder X-ray diffraction in Debye–Scherrer geometry with a flat area detector. Journal of applied crystallography , 56(1):160–166, 2 2023. doi:10.1107/s1600576722011360. [34] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE , 109:43–76, 2021. doi:10.1109/jproc.2020.3004555. [35] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convo- lutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) , pages 2414–2423, 2016. doi:10.1109/CVPR.2016.265. [36] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InProceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 , ICML’15, page 1180–1189. JMLR.org, 2015. [37] Khawla Seddiki, Philippe Saudemont, Frédéric Precioso, Nina Ogrinc, Maxence Wisztorski, Michel Salzet, Isabelle Fournier, and Arnaud Droit. Cumulative learning enables convolu- tional neural network representations for small mass spectrometry data classification. Nature Communications , 11, 2020. doi:10.1038/s41467-020-19354-z. [38] M. Aranda. Sharing powder diffraction raw data: challenges and benefits. Journal of Applied Crystallography , 2018. doi:10.1107/S160057671801556X. [39] Loes M. J. Kroon-Batenburg, Matthew P. Lightfoot, Natalie T. Johnson, and John R. Helliwell. Raw diffraction data and reproducibility. Structural Dynamics , 11, 2024. doi:10.1063/4.0000232. [40] Jin-Woong Lee, Woon Bae Park, Jin Hee Lee, Satendra Pal Singh, and Kee-Sun Sohn. A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic xrd powder patterns. Nature Communications , 11(1):86, Jan 2020. ISSN 2041-1723. doi:10.1038/s41467-019-13749-3. [41] Byung Do Lee, Jin-Woong Lee, Woon Bae Park, Joonseo Park, Min-Young Cho, Satendra Pal Singh, Myoungho Pyo, and Kee-Sun Sohn. Powder x-ray diffraction pattern is all you need for machine-learning-based symmetry identification and property prediction. Advanced Intelligent Systems , 4(7):2200042, 2022. doi:https://doi.org/10.1002/aisy.202200042. [42] Jason R. Hattrick-Simpers, Brian DeCost, A. Gilad Kusne, Howie Joress, Winnie Wong-Ng, Debra L. Kaiser, Andriy Zakutayev, Caleb Phillips, Shijing Sun, Janak Thapa, Heshan Yu, Ichiro Takeuchi, and Tonio Buonassisi. An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models. Integrating materials and manufacturing innovation , 10(2): 311–318, 6 2021. doi:10.1007/s40192-021-00213-8. 16 Page 17: [43] Jerardo E. Salgado, Samuel Lerman, Zhaotong Du, Chenliang Xu, and Niaz Abdolrahim. Automated classification of big x-ray diffraction data using deep learning models. npj Com- putational Materials , 9(1):214, Dec 2023. ISSN 2057-3960. doi:10.1038/s41524-023-01164-8. [44] Jan Schuetzke, Simon Schweidler, Friedrich R. Muenke, Andre Orth, Anurag D. Khan- delwal, Ben Breitung, Jasmin Aghassi-Hagmann, and Markus Reischl. Accelerat- ing materials discovery: Automated identification of prospects from x-ray diffraction data in fast screening experiments. Advanced Intelligent Systems , 6(3):2300501, 2024. doi:https://doi.org/10.1002/aisy.202300501. [45] Pauling File project. Linus pauling file product descriptions. https://web.archive. org/web/20240221221553/https://paulingfile.com/index.php?p=products#PAULING% 20FILE%20products , 2024. [Accessed: 27.11.24]. [46] ASM International. Pearson’s crystal data product description. https://web. archive.org/web/20240617123612/https://www.crystalimpact.com/pcd/ , 2024. [Ac- cessed: 27.11.2024]. [47] P. Villars. Mpds access link. https://mpds.io/#start , 2024. [Accessed: 04.12.2024]. [48] Crystal Impact. Pearson’s crystal data product offering. https://shop-crystalimpact.de/ en/p/pearson-s-crystal-data-one-year-single-license , 2024. [Accessed: 10.12.20224]. [49] Pauling File project. Mpds api product description. https://mpds.io/#products , 2024. [Accessed: 10.12.20224]. [50] ICDD. Pdf5 product description. https://www.icdd.com/pdf-5/ , 2024. [Accessed: 27.11.2024]. [51] ICDD. Pdf5+ license. https://www.icdd.com/licensing-process/ #1528471154226-933e5cc6-8da7 , 2025. [Accessed: 07.03.2025]. [52] University of Arizona Department of Geosciences. Rruff access link. https://web.archive. org/web/20241007175010/https://rruff.info/about/about_general.php , 2024. [Ac- cessed: 27.11.24]. [53] COD maintainers. Crystallography open database. https://www.crystallography.net/ cod/, 2024. [Accessed: 27.11.24]. [54] Saulius Gražulis, Daniel Chateigner, Robert T. Downs, A. F. T. Yokochi, Miguel Quirós, Luca Lutterotti, Elena Manakova, Justas Butkus, Peter Moeck, and Armel Le Bail. Crystal- lography open database – an open-access collection of crystal structures. Journal of Applied Crystallography , 42(4):726–729, May 2009. ISSN 0021-8898. doi:10.1107/s0021889809016690. [55] Armel Le Bail. Powbase. http://www.cristal.org/powbase/index.html , 2025. [Accessed: 07.03.25]. [56] Andriy Zakutayev, Nick Wunder, Marcus Schwarting, John D. Perkins, Robert White, Kristin Munch, William Tumas, and Caleb Phillips. An open experimental database for exploring inorganic materials. Scientific Data , 5(1), 4 2018. doi:10.1038/sdata.2018.53. [57] National Renewable Energy Laborator. High-throughput experimental database statistics. https://htem.nrel.gov/stats , 2025. [Accessed: 07.03.2025]. [58] FIZ Karlsruhe. Icsd access link. https://icsd.products.fiz-karlsruhe.de/ , 2024. [Ac- cessed: 27.11.24]. [59] Cambridge Crystallographic Data Centre. Cambridge structural database access link. https: //www.ccdc.cam.ac.uk/structures/ , 2024. [Accessed: 27.11.24]. [60] Materials Project. Materials project database website access link. https://next-gen. materialsproject.org/ , 2024. [Accessed: 27.11.24]. [61] Russian Academy of Sciences Institute of Experimental Mineralogy. Crystallographic and crystallochemical database website access link. https://database.iem.ac.ru/mincryst/ index.php , 2024. [Accessed: 27.11.24]. 17 Page 18: [62] University of the Basque Country. Bilbao incommensurate crystal structure database access link. https://www.cryst.ehu.eus/bincstrdb/search/ , 2024. [Accessed: 27.11.24]. [63] David Barthelmy. Mineralogy database access link. https://webmineral.com/ , 2024. [Ac- cessed: 27.11.24]. [64] International Union of Crystallography (IUCr). Iucr raw data letters access link. https: //iucrdata.iucr.org/x/index.html , 2024. [Accessed: 27.11.24]. [65] U.S. Naval Research Laboratory. Crystal lattice-structures access link. https://www. atomic-scale-physics.de/lattice/ , 2024. [Accessed: 27.11.24]. [66] Pierre Perroud. Athena mineral database access link. https://athena.unige.ch/athena/ mineral/mineral.html , 2024. [Accessed: 27.11.24]. [67] Research Collaboratory for Structural Bioinformatics. Protein data bank access link. https: //www.rcsb.org/ , 2024. [Accessed: 27.11.24]. [68] Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent de- velopments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 374:20150202, 2016. doi:10.1098/rsta.2015.0202. [69] S. Gražulis, D. Chateigner, R. Downs, A. Yokochi, M. Quirós, L. Lutterotti, E. Man- akova, J. Butkus, P. Moeck, and A. L. Bail. Crystallography open database – an open- access collection of crystal structures. Journal of Applied Crystallography , 42:726–729, 2009. doi:10.1107/S0021889809016690. [70] Antanas Vaitkus, Andrius Merkys, Thomas Sander, Miguel Quirós, Paul A. Thiessen, Evan E. Bolton, and Saulius Gražulis. A workflow for deriving chemical entities from crystallographic data and its application to the Crystallography Open Database. Journal of Cheminformatics , 15(1):123, Dec 2023. doi:10.1186/s13321-023-00780-2. [71] Siarhei Zhuk, Andrey A. Kistanov, Simon C. Boehme, Noémie Ott, Fabio La Mattina, Michael Stiefel, Maksym V. Kovalenko, and Sebastian Siol. Synthesis and characteriza- tion of the ternary nitride semiconductor zn2vn3: Theoretical prediction, combinatorial screening, and epitaxial stabilization. Chemistry of Materials , 33(23):9306–9316, 2021. doi:10.1021/acs.chemmater.1c03025. [72] Alexander Wieczorek, Huagui Lai, Johnpaul Pious, Fan Fu, and Sebastian Siol. Resolving oxidation states and x–site composition of sn perovskites through auger parameter analysis in xps. Advanced Materials Interfaces , 10(7):2201828, 2023. doi:https://doi.org/10.1002/admi.202201828. [73] Alexander Wieczorek, Austin G. Kuba, Jan Sommerhäuser, Luis Nicklaus Caceres, Chris- tian M. Wolff, and Sebastian Siol. Advancing high-throughput combinatorial aging studies of hybrid perovskite thin films via precise automated characterization methods and machine learning assisted analysis. J. Mater. Chem. A , 12:7025–7035, 2024. doi:10.1039/D3TA07274F. [74] Moritz Wolf, Stephen J. Roberts, Wijnand Marquart, Ezra J. Olivier, Niels T. J. Luchters, Emma K. Gibson, C. Richard A. Catlow, Jan. H. Neethling, Nico Fischer, and Michael Claeys. Synthesis, characterisation and water–gas shift activity of nano-particulate mixed-metal (Al, Ti) cobalt oxides. Dalton Transactions , 48(36):13858–13868, 1 2019. doi:10.1039/c9dt01634a. [75] Moritz Wolf, Nico Fischer, and Michael Claeys. Surfactant-free synthesis of monodisperse cobalt oxide nanoparticles of tunable size and oxidation state developed by factorial design. Materials Chemistry and Physics ,213:305–312,2018. doi:10.1016/j.matchemphys.2018.04.021. [76] Jiro Nishinaga, Takehiko Nagai, Takeyoshi Sugaya, Hajime Shibata, and Shigeru Niki. Single- crystal Cu(In,Ga)Se2solar cells grown on GaAs substrates. Applied Physics Express , 11(8): 082302, 7 2018. doi:10.7567/apex.11.082302. [77] M. D. Heinemann, R. Mainz, F. Österle, H. Rodriguez-Alvarez, D. Greiner, C. A. Kaufmann, and T. Unold. Evolution of opto-electronic properties during film formation of complex semi- conductors. Scientific Reports , 7(1), 4 2017. doi:10.1038/srep45463. 18 Page 19: [78] Tze-BinSong, ZhenghaoYuan, MegumiMori, FaizanMotiwala, GideonSegev, EloïseMasque- lier, Camelia V. Stan, Jonathan L. Slack, Nobumichi Tamura, and Carolin M. Sutter-Fella. Revealing the dynamics of hybrid metal halide perovskite formation via multimodal in situ probes.Advanced Functional Materials , 30(6), 12 2019. doi:10.1002/adfm.201908337. 19 Page 20: Supporting information for opXRD: Open Experimental Powder X-ray Diffraction Database Daniel Hollarek , Henrik Schopmans , Jona Östreicher , Jonas Teufel , Bin Cao , Adie Alwen , Simon Schweidler , Mriganka Singh , Tim Kodalle , Hanlin Hu , Gregoire Heymans , Maged Abdelsamie , Alexander Wieczorek , Siarhei Zhuk , Arthur Hardiagon , Ruth Schwaiger , François-Xavier Coudert , Moritz Wolf , Sebastian Siol , Carolin M. Sutter-Fella , Ben Breitung , Andrea M. Hodge , Tong-yi Zhang , and Pascal Friederich* *Corresponding author: pascal.friederich@kit.edu S1 Description of opXRD files on Zenodo The database comes in two zip archives, “opxrd.zip” and “opxrd_in_situ.zip”. The latter contains thein-situdatawithhighlycorrelatedpatternsrecordedthroughtimeseriesmeasurements. Within the .zip archives patterns are saved as .json files grouped in folders indicating the contributing institution. If an institution contributed data from several projects, the contributed data is further divided into folders indicating the research project. These research project folders are labeled alphabetically in the order they are introduced in Section 3. Each .json file contains a pattern recorded from an X-ray diffraction experiment. If available, the composition and structure of the investigated sample and experiment conditions are also included in this file. Patterns belonging to time series measurements are labeled with filenames that indicate the measurement series they belong to and their order in that series. S2 opXRD Python library usage TheopXRDPythonlibraryallowsthedatasettobeaccessedthroughonesimplecommand: OpXRD. load(root_dirpath) . Ifthe database is locally available under root_dirpath this command loads the library from this location. If the database is not available locally at this location, the database is automatically downloaded to root_dirpath . 1arXiv:2503.05577v2 [cond-mat.mtrl-sci] 10 Mar 2025 Page 21: S3 Combined pattern plots Figure (1) shows 50 randomly selected samples of the X-ray diffraction patterns found in each of the research projects contributed to the opXRD database. Figure 1: 50 randomly chosen X-ray diffraction patterns from each contributed dataset. The figure shows data from the following datasets: a) EMPA, b) LBNL-A, c) LBNL-B, d) LBNL-C, e) USC, f) INT, g) HKUST-A, h) HKUST-B, i) CNRS, j) IKFT. 2

---