Authors: Daniel Hollarek, Henrik Schopmans, Jona Östreicher, Jonas Teufel, Bin Cao, Adie Alwen, Simon Schweidler, Mriganka Singh, Tim Kodalle, Hanlin Hu, Gregoire Heymans, Maged Abdelsamie, Arthur Hardiagon, Alexander Wieczorek, Siarhei Zhuk, Ruth Schwaiger, Sebastian Siol, François-Xavier Coudert, Moritz Wolf, Carolin M. Sutter-Fella, Ben Breitung, Andrea M. Hodge, Tong-yi Zhang, Pascal Friederich
Paper Content:
Page 1:
opXRD: Open Experimental Powder X-ray Diffraction
Database
Daniel Hollarek1,2, Henrik Schopmans1,2, Jona Östreicher1,2, Jonas Teufel1,2, Bin Cao3, Adie Alwen4,
Simon Schweidler2, Mriganka Singh5, Tim Kodalle5,6, Hanlin Hu7, Gregoire Heymans8, Maged
Abdelsamie9,10, Arthur Hardiagon11, Alexander Wieczorek12, Siarhei Zhuk12, Ruth Schwaiger13,
Sebastian Siol12, François-Xavier Coudert11, Moritz Wolf14, Carolin M. Sutter-Fella5, Ben Breitung2,
Andrea M. Hodge4, Tong-yi Zhang3, and Pascal Friederich1,2,*
1Institute of Theoretical Informatics, Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe,
Germany. E-mail: pascal.friederich@kit.edu
2Institute of Nanotechnology, Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe, Germany
3Guangzhou Municipal Key Laboratory of Materials Informatics, Advanced Materials Thrust, Hong
Kong University of Science and Technology (Guangzhou) (HKUST), Guangzhou 511400, China
4Department of Chemical Engineering and Materials Science, University of Southern California (USC),
Los Angeles CA 90089, USA
5Molecular Foundry Division, Lawrence Berkeley National Laboratory (LBNL), Berkeley 94720 CA, USA
6Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley 94720 CA, USA
7Hoffmann Institute of Advanced Materials, Shenzhen Polytechnic, Shenzhen 518055, China
8Lawrence Berkeley National Laboratory (LBNL), Chemical Sciences Division, Berkeley 94720 CA, USA
9Material Science and Engineering Department, King Fahd University of Petroleum and Minerals
(KFUPM), Dhahran 31261, Saudi Arabia
10Interdisciplinary Research Center for Intelligent Manufacturing and Robotics, King Fahd University of
Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia
11Chimie ParisTech, PSL University, CNRS, Institut de Recherche de Chimie Paris, 75005 Paris, France
12Empa–Swiss Federal Laboratories for Materials Science and Technology (EMPA), 8600 Dübendorf,
Switzerland
13Institute of Energy Materials and Devices, Forschungszentrum Juelich GmbH, 52425 Juelich, Germany
14Engler-Bunte-Institut & Institute of Catalysis Research and Technology, Karlsruhe Institute of
Technology (KIT), Karlsruhe, Germany
*Corresponding author: pascal.friederich@kit.edu
Abstract
Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure
characterization. Despite their widespread application, analyzing pXRD diffractograms still
presentsasignificantchallengetoautomationandabottleneckinhigh-throughputdiscoveryin
self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated
powder diffraction analysis. A notable difficulty in applying machine learning to this domain is
the lack of sufficiently sized experimental datasets, which has constrained researchers to train
primarily on simulated data. However, models trained on simulated pXRD patterns showed
limited generalization to experimental patterns, particularly for low-quality experimental pat-
terns with high noise levels and elevated backgrounds. With the Open Experimental Powder
X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible
dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data
canbeusedtoevaluatetheperformanceofmodelsonexperimentaldataandunlabeledopXRD
data can help improve the performance of models on experimental data, e.g. through transfer
learning methods. We collected 92,552 diffractograms, 2179 of them labeled, from a wide spec-
trum of materials classes. We hope this ongoing effort can guide machine learning research
toward fully automated analysis of pXRD data and thus enable future self-driving materials
labs.
1arXiv:2503.05577v2 [cond-mat.mtrl-sci] 10 Mar 2025
Page 2:
1 Introduction
The advent of high-throughput experiments holds the prospect of significantly accelerating the
speed of materials discovery[1]. The synthesis and characterization of novel materials are becoming
increasingly efficient and automated, increasing the throughput of samples in experimentation
pipelines[2–4].
After fabricating a new material, a number of analysis techniques can be used to characterize
the sample. One method that can be used for phase identification, phase quantification, grain
size characterization, and to determine the crystal structure of a new material is powder X-ray
diffraction (pXRD). When using pXRD measurements, crystal structures are typically determined
through Rietveld refinement. In Rietveld refinement, an initial crystal structure model is fitted
to the observed diffractogram by iteratively updating the structural model. Each update of the
structural model seeks to minimize the difference between the observed diffractogram and the
diffractogram simulated from the current structural model[5,6]. As Rietveld refinement is a local
optimization method, the result of the refinement procedure is generally only as good as the initial
structural model the process started from.
Manually performing Rietveld refinement is time-consuming and often requires expert knowledge.
It is not scalable to the degree required to keep up with advances in throughput and efficiency
in other steps of the experimentation pipeline. The refinement process requires the operator to
determine an initial structural model from which the refinement can start and as well as initial
valuesforparametersthatcharacterizethebackground[7]. Thestructuralmodelisusuallyobtained
using search-match software, which identifies crystal structures with similar powder diffraction
patterns from a database of crystal structures with accompanying powder diffraction patterns.
However, an initial structural model obtained from such a database is not guaranteed to lead to
an accurate structure solution through Rietveld refinement, especially not for novel structures.
Additionally, attempting to refine all crystal structure parameters at once is known to lead to
unphysical results[4]. Hence parameters are refined iteratively, with each iteration only refining a
limited set of parameters. Finding the correct order in which to refine structure parameters and
finding the correct values for initial background parameters both present problems that add to the
difficulty of the refinement process.
Machine learning has the potential to speed up the manual analysis of powder diffractograms and
keep pace with an automated high-throughput experimentation environment[8,9]. Models can be
either trained to predict crystal structure information directly given a diffractogram, or they can
be used to automate the conventional refinement workflow. In the latter case a model would first
predict an initial crystal structure[9]which is then refined by a second model trained to perform
the refinement process[10]. So far due to an absence of labeled datasets with experimental diffrac-
tograms[11], machine learning in this domain has largely relied on diffractograms simulated from
known structures[12,13]or, most recently, from generated synthetic crystals[14]. Models trained
on datasets with simulated diffractograms have already shown strong performance in predicting
phases[12,15,16], lattice parameters[17–20], spacegroup[12,14,20–26], and crystallite size[17,26]from sim-
ulated diffractograms. However, the performance substantially drops off when these models are
applied to data originating from experiments[11,14,20,21,23]. This discrepancy in performance arises
due to imperfections in experimental data which are not present in diffraction patterns modeled
under ideal conditions. This is discussed in more detail below.
Both labeled and unlabeled datasets of experimental powder diffractograms hold significant value
for machine learning-based pXRD analysis, particularly with regard to bridging the performance
gap between simulated and experimental domains. Labeled experimental data can be used to
test and benchmark existing and new automated analysis approaches. This enables researchers
to gauge how well a given model would perform under real-world conditions if integrated into
an automated experimentation pipeline. Unlabeled experimental data enables machine learning
researchers to evaluate how closely their simulations represent experimental data and modify their
simulation algorithms accordingly. Unlabeled data can also find applications in transfer learning
approaches to transfer model capabilities from the domain of simulated diffractograms to the
domain of experimental diffractograms. While some experimental powder databases exist, their
utility is limited by the fact that they are either small or not openly accessible.
In this work, we introduce an open powder X-ray diffraction (opXRD) database featuring a broad
2
Page 3:
range of patterns collected from experiments. With a total of 92,552patterns collected from 6
contributing institutions, the opXRD database exceeds the size of the previously largest database
of openly accessible experimental powder diffraction data by two orders of magnitude. To the
best of our knowledge, the largest database of this type is the RRUFF database, containing 1290
experimental powder diffraction patterns[27]. Larger commerical datasets such as the PDF5+[28]
and the Linus Pauling File[29]exist, but their utility is limited by fees and restrictive licenses.
License terms of commercial datasets, such as the PDF5+ and the Linus Pauling File, prohibit or
restrict the publication of models trained on their data. In contrast, the opXRD database is both
free and imposes no restrictions on how its data is used. Fig. (1) provides an overview of machine
learning workflows enabled and supported by the opXRD database.
Powder X-Ray Diffraction (pXRD)opXRD DatabasepXRD Simulation
Sample92,552 patterns from
6 contributing institutionsX-RayMeasured
Diffractogram Crystal Structures (ICSD, Synth., ...)
Mat XRDAI
Simulated
Diffractogram
Model Training Benchmarking/Evaluation
U-Network Convolutional Network Transformer
Space Group ClassificationModel Comparison Realistic Test Distribution
Data DistributionTrain
Test
Joint Learning using Real and Simulated Patterns, Transfer Learning, ...Improving Physics Simulations
Lattice Parameter RegressionModels Tsaks
Realistic Evaluations on Real-World Data
Figure 1: Experimental powder X-ray diffraction (pXRD) patterns from several contributors are
collected in the opXRDdatabase. The proposed open-access database of experimental data aims
to support each step in the pXRD-related machine learning workflow by informing better physics
simulations, supplying model training data, and providing a foundation for realistic performance
evaluations.
Of the 92,552 patterns in the opXRD database 2179 patterns come with at least partial struc-
tural information of the underlying sample. Of these 2179 labeled patterns, more than 900 have
full structural labels including atomic coordinates. This constitutes an experimental pXRD test
dataset that is larger in size, richer in labels, and broader in represented experimental setups than
the RRUFF database, which only provides lattice parameters as labels[30]. However, since the
majority of the opXRD database is unlabeled we also want to further discuss the uses of unlabeled
data, including its role in improving pattern simulations and its application in transfer learning
approaches.
The neglected effects that lead to discrepancies between simulated patterns and patterns stem-
ming from experiments are largely known. Unaccounted effects may include preferred crystallite
orientation, variations in grain size, crystal defects, the impact of temperature on the scattering
process, internal stress, the non-monochromaticity of the X-ray source, and X-ray-induced fluo-
rescence[21,31,32]. Additionally, varying experimental setups produce distinct powder diffraction
patterns on the same sample. Features that may vary between experimental setups include the
shape of diffraction peaks, the wavelength and polarization of the employed X-ray source, and
the detector geometry[21,31,32]. The recorded scattering angles may also be slightly falsified if the
sample is displaced from its intended position[21,33]. As these and more neglected effects are in-
tegrated into the simulation process, real powder diffraction data can be used to evaluate how
closely simulated data matches up with real data. While direct comparisons are only possible on
labeled patterns, comparing the strength and prevalence of features between simulated and real
data can nevertheless provide information about the fidelity of the simulation. Taking into account
all neglected effects without making approximations will incur significant computational costs that
3
Page 4:
will lower the size of the generated training data. A more efficient approach could be to use real
experimental data to identify the effects that have the largest impact in practice and model them
heuristically.
The second way in which unlabeled experimental data can serve to bridge the performance gap
betweensimulatedandexperimentaldomainsisthroughtransferlearning. Theobjectiveoftransfer
learning is to transfer the capabilities of a model learned on a source domain in which labeled data
is abundant to a target domain in which labeled data is sparse[34]. In this context, the source
domain is simulated powder diffraction patterns and the target domain is experimental powder
diffraction patterns. Many approaches to transfer learning have been proposed, particularly in the
domain of image classification[35,36]. These existing techniques can be adapted to facilitate transfer
learning in the context of pXRD patterns. Seddiki et al. have already successfully applied transfer
learning in the domain of mass spectrometry to boost the accuracy of mass spectrum classification
models[37]. Since both mass spectrometry data and pXRD data are one-dimensional, this work
demonstrates the merit of transfer learning in a setting similar to pXRD.
The opXRD database is intended as a growing, community-driven initiative. The database we
present here is the first version, but we hope to further increase the database size through active
engagement with the pXRD community. Our primary objective is to minimize the effort and thus
the barrier to contributing experimental data to the opXRD database. Thus, we developed a
program that helps to find and share data from pXRD lab computers. Users can select their most
common pXRD file types, the program lists all files of that type, and users can select or deselect
certain folders or files for sharing. Selected contributions will be uploaded to opXRD, processed to
a common file format, and—if wanted—published on Zenodo on behalf of the contributors, before
becoming part of the opXRD database. If labels are available, they can be shared with opXRD as
well. Further details can be found on the opXRD website ( https://xrd.aimat.science/ ). An
overview of this process is given in Fig. (2) below.
Post-processing
Mat XRDAI
Figure 2: Overview of the data collection pipeline. Datasets are submitted using an online sub-
mission form, optionally with the help of our submission helper software. After post-processing
and data homogenization, we offer the creation of a Zenodo entry for each user submission and
subsequently include the submission in the opXRD database.
As argued by Aranda and Kroon-Batenburg et al.[38,39], sharing raw powder diffraction data is not
only in the interest of furthering machine learning research but is also in line with open science
principles. It furthersthe abilityof other researchers to reproducepublishedworkand in turn, adds
to the credibility of the publisher of the data. Compared to publishing data individually, publishing
data on the opXRD database has the added benefit of contributing to a large, homogenous dataset
with a standardized interface. This makes the data more easily accessible to other researchers
4
Page 5:
and provides more value to researchers seeking large quantities of data. However, further data
annotation with metadata is required to fully fulfill the FAIR data principles.
The opXRD database contains pXRD patterns from single and multiphase materials from a wide
variety of material classes, including high-entropy materials, perovskites, and commercial catalysts.
Some of the XRD data was collected on thin-films rather than on true powder samples, which may
influence the quality of the data in regards to full structure resolution. Additionally, some of the
data was collected in grazing-angle geometry rather than in the usual Bragg-Brentano geometry
employed in powder diffraction. The broad range of available experimental samples contained
in the opXRD v1.0 database makes it possible to apply state-of-the-art ML approaches to the
domain of pXRD analysis. We hope that the opXRD database can drive ML research in this
field towards more advanced automated analysis workflows that can accelerate materials science
research through ready application in high-throughput experimentation pipelines. Details of the
experiments of research groups contributing to the opXRD database are discussed in Section (3). A
detailed description of how to acquire and use opXRD data is given in Section (4), and Section (5)
describes how further data can be contributed.
Review of machine learning-based pXRD analysis
To showcase the need for datasets such as the one presented in this publication, we now discuss
some recent approaches that apply machine learning methods to classification and regression tasks
for powder diffractograms.
In 2020, Lee et al.trained a deep convolutional neural network (CNN) using simulated diffrac-
tograms based on structures from the ICSD, which is able to classify occurring phases in diffrac-
tograms of a specific compound pool[40]. In 2022, they furthermore developed models based on
fully convolutional neural networks and transformer encoders that predict the crystal system, the
spacegroup, and other structural properties, such as the band gap[41]. With their best model for
the crystal system prediction on ICSD structures, they achieved a test accuracy of 92.2 %. In 2017,
Parket al.reached a test accuracy of roughly 81 %for a CNN, which classifies space groups of
simulated single-phase diffractograms[12].
A regression analysis on lattice parameters within a broader framework encompassing all material
classes was conducted by Chitturi et al[18]in 2021. They developed a distinct CNN for each
crystal system, utilizing a merged dataset from both the ICSD and the Cambridge Structural
Database, and managed to achieve a mean absolute percentage error of about 10 %for the lattice
lengths, although they encountered difficulties in accurately predicting angles. In 2024, Zhang
et al.introduced a convolutional self-attention neural network trained on simulated patterns to
classify crystal types[20]. Their model was tested on 23,073 unary, binary, and ternary inorganic
crystal structures sourced from the COD. The study observed a noticeable performance drop when
the pre-trained model was applied to real experimental patterns as opposed to simulated data.
However, their recent work[21]proposes using convolutional peak descriptors that consider the
detector’s geometry, which reduces the performance gap in their benchmark tests.
Neural networks trained purely on experimental diffractograms can perform well when the range
of samples is narrow and the data is collected only on a single machine[13,42]. However, in a more
general setting with a wide range of investigated samples and employed diffractometers training
neural networks purely on experimental diffractograms becomes infeasible. This is because of
the limited availability of labeled experimental diffractograms relative to the scope of the task.
However, in 2023, Salgado et al.[43]showed that adding a fraction of experimental patterns to a
simulated training dataset improves the performance on unseen experimental patterns. They used
50 %of the experimental patterns contained in the RRUFF database and added those to their
large simulated training set. Then they tested their model’s performance on the other half of the
RRUFF database and achieved a performance increase in the 230-way spacegroup classification
accuracy of 11percentage points compared to the same model only trained on simulated patterns.
In 2024, Schuetzke et al.trained a classifier to classify if a diffractogram stems from an amorphous,
single-phase, or multi-phase sample[44]. Due to the lack of experimental pXRDs, they built a
pipeline to augment simulated diffractograms of a reference structure by, among other things,
slightly varying the underlying crystal lattice. For spinel structures, they reported an accuracy of
100 %but they also proved that their approach can be transferred to other datasets.
5
Page 6:
In 2023, Schopmans et al.presented an approach to generate synthetic crystal structures and their
corresponding pXRD patterns on the fly during the training process[14]. This approach defeats
the issue of a limited dataset size, which limits the depth of neural networks that can be trained.
However, the accuracy dropped substantially when we applied our space group classification model
to experimental patterns from the RRUFF database. Augmenting our simulated patterns with
background, noise, and impurities helps to bring simulated diffractograms closer to experimental
ones, making models trained on them more performant on experimental diffractograms. However,
this augmentation process could be improved by incorporating background and noise statistics
from a broader experimental pXRD database, such as the one presented in this publication.
It becomes apparent that the more general the task is, the more challenging the transfer to exper-
imental data becomes. For example, the space group classification task across all material systems
is very general. Therefore, transferring it to the application on experimental diffraction patterns
is difficult.[14,23,41]On the other hand, there are some successful approaches that also work well
on experimental data, but those are mostly methods that do phase determination in a limited
compound space, making the task less complex[40,44].
The current volume of experimental pXRD patterns is insufficient to effectively train ML models,
highlighting anurgentneed fora comprehensive experimentalpXRD database. The mostadvanced
ML models currently are trained on approximately 105−106simulated diffractograms[14,43]. This
is, to the best of our knowledge, two orders of magnitude larger than the largest currently curated
experimental dataset, the PDF-5+ with approximately 2·104experimental patterns. It is even
one order of magnitude larger than the approximately 105unlabeled diffractograms in the initial
version of the opXRD dataset we present here.
TomakeML-basedpXRDdataidentificationpracticalforexperimentaluseandautomatestructure
prediction despite lacking experimental training data two key approaches are essential. First,
developing more sophisticated simulation methods to better approximate experimental patterns[21]
by using statistics from experimental diffractograms. Second, creating an experimental database
that enables transfer learning to bridge the gap between simulated and real-world data. For
both of these steps, the development of opXRD is particularly significant, as it will provide a
comprehensive experimental benchmark for the community, allowing fair comparison of baseline
models and accurate evaluation of their applicability in real experimental situations.
2 Existing datasets
To contextualize opXRD within the current environment of experimental powder diffraction data,
the list below provides an overview of the largest crystal structure databases that offer access to
experimental powder diffraction data. For an overview of these databases refer to Tab. (1) below.
Table 1: Overview of experimental powder diffraction databases: The column “O.A.” indicates whether or not the
database is open-access. The availability of the chemical composition, spacegroups, lattice parameters, and atomic
coordinates of the underlying samples are indicated by the columns “Comp.”, “Spg.”, “Lattice” and “Atom coords.”,
respectively.
Name No. patterns O.A. Comp. Spg. Lattice Atom coords. Year est.
Linus Pauling file 21,700 ✕ ✔ ✔ ✔ ✔ 2002
Powder Diffraction File120,800 ✕ ✔ ✔ ✔ ✔ (52%) 1941
RRUFF 1290 ✔ ✔ ✔ ✔ ✕ 2006
Crystallography Open Database 1052 ✔ ✔ ✔ ✔ (85%) ✔(85%) 2003
PowBase 169 ✔ ✔ ✕ ✕ ✕ 1999
Linus Pauling File :[45]The Linus Pauling File is a largely commercial crystal structure database
published and maintained by the Pauling File project[29]. It is currently distributed as Pearson
Crystal data[46]and the Materials Platform for Data Science (MPDS)[47]. The database, first pub-
lishedin2002, currentlycontainsmorethan534,000crystalstructures[47]and21,700corresponding
experimental powder diffraction patterns[46]. This makes the Pauling file, to the best of our knowl-
edge, the largest collection of experimental powder diffraction data available to researchers. As of
1The PDF lists the Material Platform for Data Science (MPDS) as a database source. Since the MPDS is hosted
by the Pauling File project, there is likely significant overlap in the experimental patterns available in the PDF and
the Linus Pauling File.
6
Page 7:
November 2024, Pearson’s crystal data is available to researchers through a purchase of a one-year
license starting at a price point of 2200e[48]. The MPDS is partially open, with the open portion
of the MPDS data accessible through a web interface[47]. API access to the full MPDS can be
purchased through a one-year license starting at 9500e[49]. We asked the Pauling File project
whether the experimental powder diffraction data is accessible through the MPDS API. The Paul-
ing File project responded that this data is not currently provided through the API, but could be
offered in the future at the request of customers.
Powder Diffraction File:[50]The Powder Diffraction File (PDF), published and maintained
by the International Center for Diffraction Data (ICDD), is a large collection of materials with
accompanying powder diffraction data first published in 1941[28]. According to the ICDD the
latest release of the PDF, the PDF5+, contains over a million materials with accompanying powder
diffraction data. However, since most of these powder diffraction patterns are simulated we asked
the ICDD about the number of experimental diffraction patterns in the PDF5+. We were told that
20,800 of the patterns in the PDF5+ stemmed from experiments and that 10,954 of these patterns
were accompanied by the atomic coordinates of the underlying structures. Since the PDF5+ lists
the MPDS as a database source, there is likely a significant overlap in the experimental patterns
found in the PDF5+ and those found in the Pauling file. Currently, the PDF5+ is available to
researchers through a purchase of a one-year license starting at a price point of $6265. However,
the ICDD does not allow researchers to train machine learning models on PDF5+ data, regardless
of whether the resulting models are published[51].
RRUFF :[52]The RRUFF Mineral Database, first published in 2006, provides detailed informa-
tiononminerals,includingtheirchemicalcompositions,crystallography,andspectroscopicdata[27].
Managed by the University of Arizona, it was created to serve as a public repository for mineral
identification and research. It contains 1290powder diffraction patterns stemming from experi-
ments each labeled with the lattice parameters and composition of the underlying structures. The
RRUFF data is openly accessible on its official website[52].
Crystallography Open Database:[53]The Crystallography Open Database (COD) is an open-
access collection of crystal structures founded in 2003[54]. It currently provides over 500,000 crystal
structures. Of these files, 1052 contains the experimental powder diffraction data that was used to
determine the underlying crystal structures of the investigated samples. Hence, the experimental
powder diffraction data contained in the COD is mostly labeled with the full crystal structure
information. The data is openly accessible in the form of .cif files on the official COD website[53].
PowBase:[55]PowBase is a database of 169 mostly unlabeled experimental powder diffraction pat-
terns collected and maintained by crystallography researcher Armel Le Bail starting in 1999. Pow-
Base is an initiative suggested in the Structure Determination by Powder Diffractometry (SDPD)
mailing list which was co-maintained by Le Bail. The COD is another community initiative that
grew out of this mailing list. As of March 2025, all 169 patterns are still freely available for
download on the official website[55].
There is also publicly available powder diffraction data uploaded to datasets on Zenodo. However,
this data is split into disparate entries that typically only contain the work of a single research
project. Additionally, extracting powder diffraction data at scale is hindered by the fact that the
data is often given in plain text files in non-standardized formats, which are difficult to parse
automatically. We are currently planning a systematic large-scale extraction of powder diffraction
datafromdatabaseslikeZenodowiththehelpofalargelanguagemodel. Thisdatawillbeincluded
in a future release of the opXRD database.
While not strictly speaking a powder diffraction database, the High-Throughput Experimental
Materials Database (HTEM) by the National Renewable Energy Laboratory (NREL) is a valuable
source of X-ray diffraction data[56]. Currently, the HTEM database contains 65,779 thin-film
samples with corresponding X-ray diffraction data[57]. Each database entry includes the elemental
compositionoftheunderlyingsamplebutdoesnotprovideanyinformationonitsstructure. HTEM
data is open-access and can be downloaded through an API provided by NREL.
Aside from the databases mentioned above, we have also investigated several other crystal struc-
ture resources in search of experimental powder diffraction data. Crystal structure resources that
were investigated but not found to contain any appreciable amount of publicly available experimen-
tal powder diffraction data include the Inorganic Crystal Structure Database[58], the Cambridge
7
Page 8:
Structural Database[59], the Materials Project database[60], the Crystallographic and Crystallo-
chemical Database[61], the Bilbao Incommensurate Crystal Structure Database[62], the Mineralogy
Database[63], the IUCr Raw data letters[64], the U.S. Naval Research Laboratory Crystal Lattice-
Structures[65], the Athena Mineral database[66]and the Protein data bank[67]. The lack of exper-
imental powder diffraction data in these databases is to be expected as most structure solutions
are achieved through single-crystal diffraction.
3 opXRD database
In collaboration with several other research institutions, we have collected a database of 92,552
experimental patterns of which 2179 are at least partially labeled with structural information of the
underlying material. The following research institutions contributed data to the opXRD database:
The French National Centre for Scientific Research (CNRS), Hong Kong University of Science and
Technology (Guangzhou) (HKUST), University of Southern California (USC), Lawrence Berkeley
National Laboratory (LBNL), Empa–Swiss Federal Laboratories for Materials Science and Tech-
nology (EMPA) and the Karlsruhe Institute of Technology (KIT). Tab. (2) provides an overview
of the contributions of each institution. We filtered the submitted datasets to exclude patterns
with invalid features such as only one unique recorded angle, negative angles, less than 50 recorded
angles total, or all intensities being zero.
Table 2: Overview of the contributions to the opXRD database: The availability of the chemical composition,
spacegroups, lattice parameters, and atomic coordinates of the underlying samples are indicated by the columns
“Comp.”, “Spg.”, “Lattice” and “Atom coords.” respectively.
Institution No. patterns Comp. Spg. Lattice Atom coords. Research Project
CNRS 1052 ✔ ✔ (85%) ✔ ✔ (85%) Diffraction data extracted from the COD
USC 338 ✔ ✔ ✔ (90%) ✕ Study of CuNi and CuAl alloys
HKUST(GZ) 520 ✔(4%) ✔(4%) ✔(4%) ✔(4%) Phase identification dataset
EMPA 770 ✔ ✔ (63%) ✕ ✕ Metal halide perovskites, Zn-V-N libraries
INT 19,796 ✕ ✕ ✕ ✕ Compilation of various projects
IKFT 64 ✕ ✕ ✕ ✕ Commercial catalysts, metals, metal oxides
LBNL 70,012 ✕ ✕ ✕ ✕ Perovskites precursors, Mn-Sb-O system
The variance of the data was analyzed using principal component analysis (PCA). PCA can be
applied to datasets X⊂RNto reduce the number of components needed to describe points p∈X
up to some tolerance in lost accuracy. In the context of PCA, the cumulative explained variance
ratio is a measure of how much of the variance in the dataset Xcan be explained using a given
number of components. For a rigorous definition of PCA and the explained variance ratio, we refer
to the literature[68]. Here, PCA was performed on datasets of X-ray diffraction patterns. These
datasets Xare subsets of RNwith N= 512since each pattern p∈Xwas standardized to have 512
intensity values spread out evenly from 0°to180 °using zero padding and interpolation with cubic
splines. Hence the maximal components that could be needed to describe a dataset of diffraction
data in this context is N= 512. However, the maximal number of components is even lower for
datasets that contain less than 512 patterns. In this case, the maximal number of components is
equal to the number of patterns in the dataset since each pattern can add at most one degree of
freedom to the dataset X∈RN. Hence the maximum number of components Nmaxof a pattern
dataset Xis given as follows:
Nmax= min( Nvalues, Npatterns ). (1)
Here Nvalues = 512is the number of recorded intensity values per pattern and Npatternsis the
number of patterns in the dataset X. Fig (3) below shows the cumulative explained variance
ratio over the fraction of maximal No. components Nmaxas defined above. In this figure, a faster
convergenceofthecumulativevarianceratiotowardsoneindicatesthatthepatternsinthisdataare
relatively similar. The degree of variation between the patterns is different for each contribution.
For example, the CRNS and the HKUST contributions each are collections that encompass many
research projects over a large period of time and thus exhibit a high degree of variability between
individual patterns. In contrast, the contributions by USC and LBNL contain many very similar
patterns. The patterns in the USC dataset are similar because the underlying samples are all
variations of CuNi and CuAl alloys. The patterns submitted by LBNL are similar because they
8
Page 9:
stem from in-situ recordings where several hundred or several thousand patterns were collected
over time per sample while they were undergoing physical conversion processes.
Figure3: Explainedvarianceratiooverthefractionofthemaximumnumberofcomponentsforeach
dataset contributed to the opXRD database. Here the maximal No. components refers to Nmaxas
defined in equation (1). Datasets contributed by the same institution are labeled alphabetically in
the order in which they are described in the texts towards the end of this section.
Fig. (4) provides an overview of the distributions of pattern and structure properties in the opXRD
database. Nearly all patterns have an angular resolution smaller than ∆(2θ) = 0 .1◦. Here the
angular resolution is defined as the range of recorded angles divided by the number of recorded
intensity values along that range. For most patterns, the lowest recorded angle is smaller than 30◦
and the highest recorded angle is smaller than 120◦. The start-to-end angle distribution reveals
that all diffractograms start in a narrow window between 0°and approximately 50°, while they
end between 50°and150 °, with the majority of patterns going from 0°to approximately 70°.
Unlike most ML approaches using synthetic data over the full angle range with fixed resolution,
the opXRD dataset has a strongly varying angle range and resolution. Hence, working with this
data requires additional pre-processing methods such as padding and interpolation, or more flexible
ML models beyond standard CNNs.
In the following, we will describe the datasets contributed by each of the collaborating research
groups and institutions. Each paragraph includes a description of the investigated materials and
how X-ray diffraction data was collected. If applicable, the presence of thin-film samples or atypical
diffraction geometries is indicated. Most data was collected using Cu radiaton sources which has
aKα1wavelength of λ= 1.54056Å and a Kα2wavelength of λ= 1.54439Å.
Institut de Recherche de Chimie Paris, CNRS
Experimental pXRD data was extracted from the Crystallography Open Database (COD)[69,70].
The COD is, to our knowledge, the largest open-access collection of experimental crystal structures
of organic, inorganic, and metal-organic compounds and minerals, containing more than 500,000
entries. The data in the COD are placed in the public domain and licensed under the CC0 License.
Of the entire COD database 5432 structures contained at least one tag from the CIF_POW
dictionary, i.e., a tag relating to powder diffraction studies. These 5432 structures only account
for 1% of the total COD database, but this is to be expected since most crystal structures are
resolved from single-crystal diffraction. Of these 5432 files, most contained only metadata related
to the powder diffraction experiment, but did not include the raw data of the pattern itself. We
could extract raw experimental pXRD patterns from 1052 files in total, after curation of a small
number of files with clearly invalid data.
The pXRD data from the COD database are of high quality, with a median resolution of ∆(2θ) =
0.013 °and an average number of 9190 points measured per pattern. They span a wide chemical
space, including organic, inorganic, and hybrid structures, and 75 different elements of the periodic
table.
9
Page 10:
Figure 4: Histograms detailing the distribution of pattern and structure properties in the opXRD
database: a) distribution of spacegroups present in labeled data; b) distribution of angular resolu-
tion in all data; c) distribution of smallest and largest recorded 2θvalues for all data.
Guangzhou Municipal Key Laboratory of Materials Informatics, HKUST(GZ)
Two datasets were contributed to the opXRD database. The first dataset (HKUST-A) is a
selected subset of a small-scale experimental powder X-ray database developed over the past
two years, called the X-Ray Phase Identification Public Experimental Dataset (XRed) ( https:
//github.com/WPEM/XRED ). The primary goal of XRed is to support the advancement of in-
telligent phase identification technology by providing a foundation for data collection in future
large-scale machine learning applications. XRed primarily focuses on metal and metal-oxide par-
ticles, with data collected using diffractometers such as the Empyrean 3.0, Aeris, and Bruker D8
Advance, all employing Cu X-ray sources. The dataset HKUST-A contains 21 pXRD patterns each
labeled with a corresponding CIF file that documents the refined structure. Data are categorized
by elemental systems and include original experimental files, spanning single-phase to five-phase
mixtures, as well as mixtures designed for various research tasks.
In addition to XRed, the opXRD database integrates an experimental dataset composed of powder
diffraction data sourced from open-access publications and collaborating institutions (HKUST-B).
These institutions have provided the data with full authorization for research purposes. Compared
to XRed, this dataset offers broader chemical element coverage, encompassing ionic, atomic, and
metallic crystals. It is also larger, containing 499 entries. However, unlike XRed, these data entries
are not accompanied by CIF files.
Laboratory for Surface Science and Coating Technologies, Empa
Combinatorial Zn–V–N libraries were synthesized using radio-frequency co-sputtering of Zn and V
in a mixed Ar and N 2plasma. An orthogonal deposition temperature and composition gradient
was created, resulting in a deposition temperature of 220 °C for samples 1 – 9 and 114 °C for
samples 37 – 45. The composition for each sample was determined using X-ray fluorescence (XRF)
spectroscopy which was further calibrated through Rutherford backscattering spectroscopy (RBS)
based on selected samples. The newly identified and isolated semiconductor Zn 2VN3was identified
to exhibit a cation-disordered wurtzite structure as verified by additional GI-XRD and SAED
measurements[71].
10
Page 11:
Tin halide perovskites were deposited using single-step spin-coating as reported elsewhere[72].
Methylammonium lead iodide libraries with varying degrees of residual PbI 2were deposited using
a two-step procedure involving both thermal evaporation of PbI 2and subsequent spin-coating of
a methylammonium solution. The relative phase fractions were quantified using supplementary
azimuthal angle scans coupled with structural factors and geometrical factors as reported else-
where[73]. Fully inorganic lead perovskite libraries were prepared using thermal co-evaporation of
lead and cesium halide salts. All metal halide perovskite libraries were measured within a custom-
made X-Ray transparent inert-gas dome, resulting in the presence of minor additional features
within the θ= 19–31°range. For all combinatorial libraries where any phases are specified, the
complete set of phases is reported in the metadata.
XRD data was measured using a Bruker D8 Discover equipped with a Cu radiation source in a
Bragg-Brentano geometry. For the reported data sets the instrument was equipped with a Goebel
mirror effectively removing the Cu K βradiation. The data set originates from the combinatorial
exploration of the Zn–V–N compositional space, as well as data gathered from multiple research
activities on more established metal halide perovskite semiconductors. All data was collected from
thin films deposited on borosilicate glass. The Zn–V–N films showed some preferential out-of-
plane orientation, while for the perovskites the preferential orientation was minimal, resulting in
the presence of all reflections.
Institute of Nanotechnology, KIT
X-ray diffraction data was collected from a wide range of research projects conducted at the Insti-
tuteofNanotechnologyoverthepast10years. Amajorpartoftheresearchfocusedonhigh-entropy
materials, which involved incorporating many different elements into single-phase structures, lead-
ingtopeakshiftsorphaseseparations. Mostofthosemulti-componentcomplexmaterialsappeared
in various structures, including rock-salt, spinel, fluorite, perovskite, and delafossite. The samples
were prepared either in powder or in bulk form; therefore, powder XRD was performed on samples
with adjusted height. The samples were prepared using various synthesis techniques, mostly solid-
state or wet chemical syntheses, to obtain the desired structures. Consequently, particle size and
crystallinity varied significantly. The sample set also includes samples that were not successfully
measured or where phases could not be identified.
The X-ray diffraction data were collected on a Bruker D8 Advance using a Cu radiation source or
a STOE Stadi P diffractometer equipped with a Ga-jet X-ray source. The samples were initially
recorded for various research projects over the last ten years and were measured with different step
sizes, times per step, and over different angle ranges, but all using Cu Kαor Ga Kβradiation. The
samples mostly contained transition metal oxides, sulfides, and fluorides. To improve statistics, the
samples were rotated during the entire measurement. Some air-sensitive samples were measured
using a transparent polymer dome for protection. This dome led to increased background noise
over the first 20°and slightly decreased pattern resolution.
Institute of Catalysis Research and Technology, KIT
Avarietyofsampleswereanalyzedincludingcommercialcatalysts, bulkreferencematerials, porous
metal oxide particles, and nanoparticles. The latter were synthesized via the surfactant-free benzyl
alcohol route[74,75]. The cobalt oxide (CoO or Co 3O4) and cerium oxide (CeO 2) nanoparticles
were in the size range of 4−16 nmaccording to the Scherrer equation. A series of porous Al 2O3
materials, which were prepared by calcination of boehmite (AlOOH) at various temperatures,
represents crystalline samples with limited long-range structure and various contributions of Al 2O3
polymorphs.
X-ray diffraction (XRD) was conducted with an X’Pert Pro MPD (Panalytical) in Bragg-Brentano
geometry using a Cu X-ray source. The patterns were acquired in the 2θrange of 5−80°with
a step size of 0.016711 °or0.033420 °and a total acquisition time of 40 to 120 min. This study
has been carried out with the support of Angelina Barthelmeß, Elisabeth Herzinger, and Henning
Hinrichs.
11
Page 12:
Molecular Foundry Division & Advanced Light Source & Chemical Sciences Division,
LBNL
In total four different datasets were collected. The first dataset (LBNL-A) was collected from
spin-coating and annealing triple-cation metal-halide perovskite precursor solutions with the com-
position Cs 0.05(MA 0.23FA0.77)Pb 1.1(I0.77Br0.23)3onto various substrates. Here, MA stands for
Methylammonium and FA stands for Formamidinium. The substrates onto which these solutions
were coated include glass, which is amorphous, and GaAswafers, which are single crystalline.
Other substrates were stacks of glass/indium tin oxide, stacks of GaAs/CIGS, and stacks of glass/-
CIGS. Here, CIGS stands for a stack of Mo, Cu(In, Ga)Se 2,Cds and ZnO. Some of the substrates
were additionally covered with a self-assembling monolayer of MeO-2PACz. The GaAssubstrates
were prepared by Dr. Jiro Nishinaga from the National Institute of Advanced Industrial Science
and Technology (AIST) in Japan[76]and the glass/CIGS substrates by Dr. Christian Kaufmann
and his team at Helmholtz-Zentrum Berlin (HZB) in Germany[77]. Data collection was performed
in situ during thin-film deposition using a custom-made spin-coating and annealing stage[78].
A second dataset (LBNL-B) was collected from spin-coating metal-halide perovskite precursor
solutions with varying compositions of MAPb(I 1–xBrx)3spin-coated onto glass substrates. Here,
MA = Methylammonium and x = 0, 0.33, 0.5, 0.67, 1. The substrates were preheated to different
temperatures including 30°C,50°C,70°C, and 90°C, and the spin-coating process was performed
at a constant temperature on the preheated substrates. For both datasets, diffraction data were
continuously measured during spin-coating, chemical induction of crystallization, and annealing of
thesamples, at 100 °Cand 110 °Crespectively. Thediffractiondatawasrecordedwithafrequencyof
about 0.56 1/sand0.54 1/s. Each in situ measurement consisted of about 500 to 1000 individual
diffractograms. Depending on the substrate, each series of diffractograms shows an evolution
from substrate only to a combination of polycrystalline perovskite, PbI 2and substrate via several
intermediate phases.
For these two datasets, experimental XRD data were collected at beamline 12.3.2 of the Advanced
Light Source, the synchrotron at Lawrence Berkeley National Laboratory. The data were col-
lected using a photon energy of 10 keV(λ= 1.23984Å), selected using a Si(111) monochromator.
Measurements were taken in grazing incidence geometry, i.e. using a beam incidence angle of
1°. Two-dimensional diffraction images were recorded using a Dectris Pilatus 1M area detector at
an angle between 34°and36°with a sample-to-detector distance of roughly 190 mm. The two-
dimensional data were calibrated using an Al 2O3calibration standard and integrated along the
azimuthal angle.
A third dataset (LBNL-C) was collected by observing the phase evolution of an Mn-Sb-O system
with varying annealing temperatures. The temperatures used to analyze the crystal structure of
the Mn-Sb-O system were chosen depending on the number of phase transitions appearing for a
certain temperature range. Few changes in the crystal structure appear between room temperature
and 300 °C and phase transitions appeared from 300 °C until 850 °C. No phase transition appeared
when cooling down. Therefore, the crystal structure was measured every 100 °C between room
temperature and 300 °C; every 50 °C between 300 °C and 850 °C; and every 200 °C when cooling
down. The heating and cooling rates were fixed for all the experiments at 50 °C/min and the
holding time was fixed to 2 min.
This data was collected using the in situRigaku-SmartLab3kW diffractometer. This tool operates
with SmartLab Studio II software, which can measure the X-ray diffraction during the annealing
process. This enables directly showing all the phase transitions when annealing in various atmo-
spheres such as O 2, Ar, and NH 3. Phase transitions are analyzed with the in situXRD tool up
to 850 °C in this work. Most of the in situexperiments were performed under an air-like 20%
O2and 80% Ar environment is chosen (Ar flow: 50 sccm, O2flow: 10 sccm). When a 100% Ar
environment is fixed, an Ar flow of 60 sccmis input. The Bragg-Brentano (BB) mode is preferred
in terms of geometry because it is more adapted in the analysis of scarce phases such as MnSb 2O6
rutile. The angular step used in the recording was 0.01 °and the scanning rate was 10 °/min.
A fourth dataset (LBNL-D) was collected from a two-step spin-coating process using metal-organic
frameworks (MOFs) in perovskite precursor solutions, deposited onto glass substrates. In the first
step, a nanoscale thiol-functionalized UiO-66-type Zr-based MOF (UiO-66-(SH) 2) was added to
the PbI 2precursor. This was followed by the deposition of an organic mixture solution containing
12
Page 13:
FAI, MACl, and MABr in the second step. The incorporation of MOFs aids in suppressing per-
ovskite vacancy defects, thereby enhancing device stability and efficiency. To further investigate
the influence of UiO-66-(SH) 2) on perovskite thin-film formation during the annealing process, a
time-resolved GIWAXS experiment was conducted. The measurements were performed using a
setup similar to that of LBNL-A and B.
4 Usage
The opXRD database is hosted on Zenodo ( https://zenodo.org/records/14254270 ) and can be
downloaded by any user without any barriers or restrictions.
Next to the availability of the opXRD dataset on Zenodo, we also provide a Python library
“opxrd” to easily download and interface with the dataset. The instructions for how to install
this library can be found in the repository associated with the library. The repository to this
library is located at https://github.com/aimat-lab/opxrd . The opxrd library includes op-
tions for data-loading, standardization, plotting, and the conversion to PyTorch tensors. We
provide a Jupyter Notebook ( https://colab.research.google.com/github/aimat-lab/opXRD/
blob/main/opxrd/usage.ipynb ) that showcases these functionalities in more detail. This note-
book also illustrates how to interface with the opXRD database through Python.
5 Summary and Outlook
With the opXRD database, a curation of 92,552unlabeled and 2179 at least partially labeled
experimental powder X-ray diffraction patterns from a wide range of different materials systems,
we provide the largest currently available source of experimental XRD patterns. With this, we
address the need for experimental data that arises when developing algorithms and analysis tools
for pXRD data, both based on machine learning and classical approaches. The data can be used
for the actual method development and for testing. Our dataset is a valuable and so far missing
resource to drive further developments in the automated analysis of XRD data.
Rather than a finished project, the opXRD database is an ongoing effort to collect experimental
powder XRD data. We invite everyone who is working in the area of experimental powder XRD to
submit it to the dataset, in order to further improve the utility of the dataset and thus aid further
developments in this field. Our submission page ( https://xrd.aimat.science/ ) and submission
helper software will be kept available to collect more data. We will keep updating and maintaining
the dataset with new incoming submissions.
Data availability
The opXRD database is available on Zenodo at https://zenodo.org/records/14254270 . It is
publishedundertheCreativeCommonsAttribution4.0Internationallicense. Itcanbedownloaded
by any user without any barriers or restrictions. For further details, please refer to Section (4).
Conflicts of interest
There are no conflicts of interest to declare.
Acknowledgements
H.S. acknowledges financial support by the German Research Foundation (DFG) through the Re-
search Training Group 2450 “Tailored Scale-Bridging Approaches to Computational Nanoscience”.
P.F. and D.H. acknowledge support by the Federal Ministry of Education and Research (BMBF)
under Grant No. 01DM21001B (German-Canadian Materials Acceleration Center). J.Oe. and
P.F. acknowledge financial support from the Helmholtz Foundation Model Initiative within Project
"SOL-AI". Part of this work was funded under the France 2030 framework by Agence Nationale de
la Recherche (project ANR-22-PEXD-0009 of PEPR DIADEM). Work at the Molecular Foundry
was supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department
of Energy under Contract No. DE-AC02-05CH11231. Work at the Advanced Light Source (ALS)
was done at beamline 12.3.2. The ALS is a DOE Office of Science User Facility under contract no.
13
Page 14:
DE-AC02-05CH11231. The development of the online phase identification platform is supported
by the Guangzhou-HKUST(GZ) Joint Funding Program (No. 2023A03J0003). Work by the USC
group was supported by the National Science Foundation (NSF) grant numbers DMR-2227178 and
OISE-2106597. M.W. acknowledges funding by the Helmholtz Research Program “Materials and
Technologies for the Energy Transition (MTET), Topic 3: Chemical Energy Carriers". Work by
the Empa group was supported by the Strategic Focus Area–Advanced Manufacturing (SFA–AM)
through the project Advancing manufacturability of hybrid organic–inorganic semiconductors for
large area optoelectronics (AMYS) as well as the Empa internal research call 2020. We thank BW-
Cloud, funded by the Ministry of Science, Research and Arts Baden-Württemberg, for providing
cloud server infrastructure.
References
[1] Yihao Liu, Ziheng Hu, Zhiguang Suo, Lianzhe Hu, Lingyan Feng, Xiuqing Gong, Yi Liu,
and Jincang Zhang. High-throughput experiments facilitate materials innovation: A review.
Science China Technological Sciences , 62:521–545, 2019. doi:10.1007/S11431-018-9369-9.
[2] B. P. MacLeod, F. G. L. Parlane, T. D. Morrissey, F. Häse, L. M. Roch, K. E. Dettelbach,
R. Moreira, L. P. E. Yunker, M. B. Rooney, J. R. Deeth, V. Lai, G. J. Ng, H. Situ, R. H.
Zhang, M. S. Elliott, T. H. Haley, D. J. Dvorak, A. Aspuru-Guzik, J. E. Hein, and C. P.
Berlinguette. Self-driving laboratory for accelerated discovery of thin-film materials. Science
Advances , 6(20):eaaz8867, 2020. doi:10.1126/sciadv.aaz8867.
[3] A. Ludwig. Discovery of new materials using combinatorial synthesis and high-throughput
characterization of thin-film materials libraries combined with computational methods. npj
Computational Materials , 5:1–7, 2019. doi:10.1038/s41524-019-0205-0.
[4] Yoshihiko Ozaki, Yuta Suzuki, T. Hawai, Kotaro Saito, Masaki Onishi, and K. Ono. Auto-
matedcrystalstructureanalysisbasedonblackboxoptimisation. npj Computational Materials ,
6:1–7, 2020. doi:10.1038/s41524-020-0330-9.
[5] Robert E. Dinnebier, Andreas Leineweber, and John S. O. Evans. Rietveld Refinement: Prac-
tical Powder Diffraction Pattern Analysis using TOPAS . De Gruyter, 2019. ISBN 978-3-11-
045621-9.
[6] Diego Alberto Flores Cano, Anais Roxana Chino Quispe, Renzo Rueda Vellasmin, Joao An-
dreOcampoAnticona, J.González, andJ.A.RamosGuivar. Fiftyyearsofrietveldrefinement:
Methodology and guidelines in superconductors and functional magnetic nanoadsorbents. Re-
vista de Investigación de Física , 2021. doi:10.15381/rif.v24i3.21028.
[7] L. B. McCusker, R. B. Von Dreele, D. E. Cox, D. Louër, and P. Scardi. Ri-
etveld refinement guidelines. Journal of Applied Crystallography , 32(1):36–50, 2 1999.
doi:10.1107/s0021889898009856.
[8] Ankit Agrawal and A. Choudhary. Deep materials informatics: Applications of deep learning
in materials science. MRS Communications , 9:779–792, 2019. doi:10.1557/MRC.2019.73.
[9] Vasile-AdrianSurduandRomualdGyőrgy. X-raydiffractiondataanalysisbymachinelearning
methods—a review. Applied Sciences , 2023. doi:10.3390/app13179992.
[10] Zhenjie Feng, Q. Hou, Y. Zheng, W. Ren, Junyi Ge, Tao Li, Cheng Cheng, Wencong Lu,
S. Cao, Jincang Zhang, and Tong-Yi Zhang. Method of artificial intelligence algorithm to
improve the automation level of rietveld refinement. Computational Materials Science , 2019.
doi:10.1016/J.COMMATSCI.2018.10.006.
[11] Hong Wang, Yunchao Xie, Dawei Li, Heng Deng, Yun-Zhi Zhao, Ming Xin, and Jian
Lin. Rapid identification of x-ray diffraction patterns based on very limited data by inter-
pretable convolutional neural networks. Journal of chemical information and modeling , 2020.
doi:10.1021/acs.jcim.0c00020.
[12] W. Park, Jiyong Chung, Jaeyoung Jung, Keemin Sohn, S. Singh, M. Pyo, N. Shin, and
K. Sohn. Classification of crystal structure using a convolutional neural network. IUCrJ, 4:
486 – 494, 2017. doi:10.1107/S205225251700714X.
14
Page 15:
[13] Byung Do Lee, Jin-Woong Lee, Junuk Ahn, Seonghwan Kim, W. Park, and K. Sohn. A
deep learning approach to powder x-ray diffraction pattern analysis: Addressing generaliz-
ability and perturbation issues simultaneously. Advanced Intelligent Systems , 5:2300140, 2023.
doi:10.1002/aisy.202300140.
[14] Henrik Schopmans, Patrick Reiser, and Pascal Friederich. Neural networks trained on syn-
thetically generated crystals can extract structural information from icsd powder x-ray diffrac-
tograms. Digital Discovery , 2(5):1414–1424, 2023. ISSN 2635-098X. doi:10.1039/d3dd00071k.
[15] Di Chen, Yiwei Bai, Sebastian Ament, Wenting Zhao, Dan Guevarra, Lan Zhou, Bart Selman,
R. Bruce van Dover, John M. Gregoire, and Carla P. Gomes. Automating crystal-structure
phase mapping by combining deep learning with constraint reasoning. Nature Machine Intel-
ligence, 3(9):812–822, September 2021. ISSN 2522-5839. doi:10.1038/s42256-021-00384-1.
[16] Ming-Chiang Chang, Sebastian Ament, Maximilian Amsler, Duncan R. Sutherland, Lan
Zhou, John M. Gregoire, Carla P. Gomes, R. Bruce van Dover, and Michael O. Thompson.
Probabilistic Phase Labeling and Lattice Refinement for Autonomous Material Research.
(arXiv:2308.07897), August 2023. doi:10.48550/arXiv.2308.07897.
[17] H.Dong,K.Butler,D.Matras,S.W.T.Price,Y.Odarchenko,RahulKhatry,AndrewThomp-
son, V. Middelkoop, S. Jacques, A. Beale, and A. Vamvakeros. A deep convolutional neural
network for real-time full profile analysis of big powder diffraction data. npj Computational
Materials , 7:1–9, 2021. doi:10.1038/s41524-021-00542-4.
[18] Sathya R. Chitturi, Daniel Ratner, Richard C. Walroth, Vivek Thampy, Evan J. Reed, Mike
Dunne, Christopher J. Tassone, and Kevin H. Stone. Automated prediction of lattice param-
eters from x-ray powder diffraction patterns. Journal of Applied Crystallography , 54:1799 –
1810, 2021.
[19] S.Habershon, E.Cheung, K.Harris, andR.Johnston. Powderdiffractionindexingasapattern
recognition problem: A new approach for unit cell determination based on an artificial neural
network. Journal of Physical Chemistry A , 108:711–716, 2004. doi:10.1021/JP0310596.
[20] Shouyang Zhang, Bin Cao, Tianhao Su, Yue Wu, Zhenjie Feng, Jie Xiong, and Tong-Yi Zhang.
Crystallographic phase identifier of a convolutional self-attention neural network (cpicann) on
powder diffraction patterns. IUCrJ, 11(Pt 4):634, 2024.
[21] Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, and Tong-yi Zhang. Simxrd-4m:
Big simulated x-ray diffraction data and crystal symmetry classification benchmark. arXiv
preprint arXiv:2406.15469 , 2024.
[22] Felipe Oviedo, Zekun Ren, Shijing Sun, C. Settens, Zhe Liu, N. T. P. Hartono, Savitha
Ramasamy, Brian L. DeCost, S. Tian, Giuseppe Romano, A. Gilad Kusne, and T. Buonassisi.
Fastandinterpretableclassificationofsmallx-raydiffractiondatasetsusingdataaugmentation
and deep neural networks. npj Computational Materials , 5:1–9, 2018. doi:10.1038/s41524-019-
0196-x.
[23] Pascal M. Vecsei, Kenny Choo, Johan Chang, and T. Neupert. Neural network based clas-
sification of crystal symmetries from x-ray diffraction patterns. Physical Review B , 2018.
doi:10.1103/PhysRevB.99.245120.
[24] A. N. Zaloga, V. V. Stanovov, O. E. Bezrukova, P. S. Dubinin, and I. S. Yaki-
mov. Crystal symmetry classification from powder x-ray diffraction patterns using
a convolutional neural network. Materials Today Communications , 25:101662, 2020.
doi:10.1016/j.mtcomm.2020.101662.
[25] Yuta Suzuki, H. Hino, T. Hawai, Kotaro Saito, M. Kotsugi, and K. Ono. Symmetry predic-
tion and knowledge discovery from x-ray diffraction patterns using an interpretable machine
learning approach. Scientific Reports , 10:21790, 2020. doi:10.1038/s41598-020-77474-4.
[26] Abhik Chakraborty and Raksha Sharma. A deep crystal structure identification system for
x-ray diffraction patterns. The Visual Computer , 38:1275 – 1282, 2021. doi:10.1007/s00371-
021-02165-8.
15
Page 16:
[27] Barbara Lafuente, R. T. Downs, H. Yang, and N. Stone. 1. The power of databases: The
RRUFF project . De Gruyter, 11 2015. doi:10.1515/9783110417104-003.
[28] S. Gates-Rector and T. Blanton. The powder diffraction file: a quality materials characteri-
zation database. Powder Diffraction , 34:352 – 360, 2019. doi:10.1017/S0885715619000812.
[29] Pierre Villars, Karin Cenzual, Roman Gladyshevskii, and Shuichi Iwata. PAULING
FILE - towards a holistic view. Chemistry of Metals and Alloys , 11(3/4):43–76, 1 2018.
doi:10.30970/cma11.0382.
[30] Thomas Armbruster and Rosa Micaela Danisi, editors. Highlights in Mineralogical Crystal-
lography. De Gruyter, 2015. ISBN 9783110417104.
[31] Yoshio Waseda, Eiichiro Matsubara, and Kozo Shinoda. X-Ray Diffraction Crystallography .
Springer, 2011. doi:10.1007/978-3-642-16635-8.
[32] Vitalij Pecharsky and Peter Zavalij. Fundamentals of Powder Diffraction and Structural Char-
acterization of Materials . Springer, 2023. doi:10.1007/b106242.
[33] Benjamin S. Hulbert and Waltraud M. Kriven. Specimen-displacement correction for powder
X-ray diffraction in Debye–Scherrer geometry with a flat area detector. Journal of applied
crystallography , 56(1):160–166, 2 2023. doi:10.1107/s1600576722011360.
[34] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui
Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE ,
109:43–76, 2021. doi:10.1109/jproc.2020.3004555.
[35] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convo-
lutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR) , pages 2414–2423, 2016. doi:10.1109/CVPR.2016.265.
[36] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.
InProceedings of the 32nd International Conference on International Conference on Machine
Learning - Volume 37 , ICML’15, page 1180–1189. JMLR.org, 2015.
[37] Khawla Seddiki, Philippe Saudemont, Frédéric Precioso, Nina Ogrinc, Maxence Wisztorski,
Michel Salzet, Isabelle Fournier, and Arnaud Droit. Cumulative learning enables convolu-
tional neural network representations for small mass spectrometry data classification. Nature
Communications , 11, 2020. doi:10.1038/s41467-020-19354-z.
[38] M. Aranda. Sharing powder diffraction raw data: challenges and benefits. Journal of Applied
Crystallography , 2018. doi:10.1107/S160057671801556X.
[39] Loes M. J. Kroon-Batenburg, Matthew P. Lightfoot, Natalie T. Johnson, and John R.
Helliwell. Raw diffraction data and reproducibility. Structural Dynamics , 11, 2024.
doi:10.1063/4.0000232.
[40] Jin-Woong Lee, Woon Bae Park, Jin Hee Lee, Satendra Pal Singh, and Kee-Sun Sohn. A
deep-learning technique for phase identification in multiphase inorganic compounds using
synthetic xrd powder patterns. Nature Communications , 11(1):86, Jan 2020. ISSN 2041-1723.
doi:10.1038/s41467-019-13749-3.
[41] Byung Do Lee, Jin-Woong Lee, Woon Bae Park, Joonseo Park, Min-Young Cho, Satendra
Pal Singh, Myoungho Pyo, and Kee-Sun Sohn. Powder x-ray diffraction pattern is all you
need for machine-learning-based symmetry identification and property prediction. Advanced
Intelligent Systems , 4(7):2200042, 2022. doi:https://doi.org/10.1002/aisy.202200042.
[42] Jason R. Hattrick-Simpers, Brian DeCost, A. Gilad Kusne, Howie Joress, Winnie Wong-Ng,
Debra L. Kaiser, Andriy Zakutayev, Caleb Phillips, Shijing Sun, Janak Thapa, Heshan Yu,
Ichiro Takeuchi, and Tonio Buonassisi. An Open Combinatorial Diffraction Dataset Including
Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training
New Machine Learning Models. Integrating materials and manufacturing innovation , 10(2):
311–318, 6 2021. doi:10.1007/s40192-021-00213-8.
16
Page 17:
[43] Jerardo E. Salgado, Samuel Lerman, Zhaotong Du, Chenliang Xu, and Niaz Abdolrahim.
Automated classification of big x-ray diffraction data using deep learning models. npj Com-
putational Materials , 9(1):214, Dec 2023. ISSN 2057-3960. doi:10.1038/s41524-023-01164-8.
[44] Jan Schuetzke, Simon Schweidler, Friedrich R. Muenke, Andre Orth, Anurag D. Khan-
delwal, Ben Breitung, Jasmin Aghassi-Hagmann, and Markus Reischl. Accelerat-
ing materials discovery: Automated identification of prospects from x-ray diffraction
data in fast screening experiments. Advanced Intelligent Systems , 6(3):2300501, 2024.
doi:https://doi.org/10.1002/aisy.202300501.
[45] Pauling File project. Linus pauling file product descriptions. https://web.archive.
org/web/20240221221553/https://paulingfile.com/index.php?p=products#PAULING%
20FILE%20products , 2024. [Accessed: 27.11.24].
[46] ASM International. Pearson’s crystal data product description. https://web.
archive.org/web/20240617123612/https://www.crystalimpact.com/pcd/ , 2024. [Ac-
cessed: 27.11.2024].
[47] P. Villars. Mpds access link. https://mpds.io/#start , 2024. [Accessed: 04.12.2024].
[48] Crystal Impact. Pearson’s crystal data product offering. https://shop-crystalimpact.de/
en/p/pearson-s-crystal-data-one-year-single-license , 2024. [Accessed: 10.12.20224].
[49] Pauling File project. Mpds api product description. https://mpds.io/#products , 2024.
[Accessed: 10.12.20224].
[50] ICDD. Pdf5 product description. https://www.icdd.com/pdf-5/ , 2024. [Accessed:
27.11.2024].
[51] ICDD. Pdf5+ license. https://www.icdd.com/licensing-process/
#1528471154226-933e5cc6-8da7 , 2025. [Accessed: 07.03.2025].
[52] University of Arizona Department of Geosciences. Rruff access link. https://web.archive.
org/web/20241007175010/https://rruff.info/about/about_general.php , 2024. [Ac-
cessed: 27.11.24].
[53] COD maintainers. Crystallography open database. https://www.crystallography.net/
cod/, 2024. [Accessed: 27.11.24].
[54] Saulius Gražulis, Daniel Chateigner, Robert T. Downs, A. F. T. Yokochi, Miguel Quirós,
Luca Lutterotti, Elena Manakova, Justas Butkus, Peter Moeck, and Armel Le Bail. Crystal-
lography open database – an open-access collection of crystal structures. Journal of Applied
Crystallography , 42(4):726–729, May 2009. ISSN 0021-8898. doi:10.1107/s0021889809016690.
[55] Armel Le Bail. Powbase. http://www.cristal.org/powbase/index.html , 2025. [Accessed:
07.03.25].
[56] Andriy Zakutayev, Nick Wunder, Marcus Schwarting, John D. Perkins, Robert White, Kristin
Munch, William Tumas, and Caleb Phillips. An open experimental database for exploring
inorganic materials. Scientific Data , 5(1), 4 2018. doi:10.1038/sdata.2018.53.
[57] National Renewable Energy Laborator. High-throughput experimental database statistics.
https://htem.nrel.gov/stats , 2025. [Accessed: 07.03.2025].
[58] FIZ Karlsruhe. Icsd access link. https://icsd.products.fiz-karlsruhe.de/ , 2024. [Ac-
cessed: 27.11.24].
[59] Cambridge Crystallographic Data Centre. Cambridge structural database access link. https:
//www.ccdc.cam.ac.uk/structures/ , 2024. [Accessed: 27.11.24].
[60] Materials Project. Materials project database website access link. https://next-gen.
materialsproject.org/ , 2024. [Accessed: 27.11.24].
[61] Russian Academy of Sciences Institute of Experimental Mineralogy. Crystallographic and
crystallochemical database website access link. https://database.iem.ac.ru/mincryst/
index.php , 2024. [Accessed: 27.11.24].
17
Page 18:
[62] University of the Basque Country. Bilbao incommensurate crystal structure database access
link. https://www.cryst.ehu.eus/bincstrdb/search/ , 2024. [Accessed: 27.11.24].
[63] David Barthelmy. Mineralogy database access link. https://webmineral.com/ , 2024. [Ac-
cessed: 27.11.24].
[64] International Union of Crystallography (IUCr). Iucr raw data letters access link. https:
//iucrdata.iucr.org/x/index.html , 2024. [Accessed: 27.11.24].
[65] U.S. Naval Research Laboratory. Crystal lattice-structures access link. https://www.
atomic-scale-physics.de/lattice/ , 2024. [Accessed: 27.11.24].
[66] Pierre Perroud. Athena mineral database access link. https://athena.unige.ch/athena/
mineral/mineral.html , 2024. [Accessed: 27.11.24].
[67] Research Collaboratory for Structural Bioinformatics. Protein data bank access link. https:
//www.rcsb.org/ , 2024. [Accessed: 27.11.24].
[68] Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent de-
velopments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences , 374:20150202, 2016. doi:10.1098/rsta.2015.0202.
[69] S. Gražulis, D. Chateigner, R. Downs, A. Yokochi, M. Quirós, L. Lutterotti, E. Man-
akova, J. Butkus, P. Moeck, and A. L. Bail. Crystallography open database – an open-
access collection of crystal structures. Journal of Applied Crystallography , 42:726–729, 2009.
doi:10.1107/S0021889809016690.
[70] Antanas Vaitkus, Andrius Merkys, Thomas Sander, Miguel Quirós, Paul A. Thiessen, Evan E.
Bolton, and Saulius Gražulis. A workflow for deriving chemical entities from crystallographic
data and its application to the Crystallography Open Database. Journal of Cheminformatics ,
15(1):123, Dec 2023. doi:10.1186/s13321-023-00780-2.
[71] Siarhei Zhuk, Andrey A. Kistanov, Simon C. Boehme, Noémie Ott, Fabio La Mattina,
Michael Stiefel, Maksym V. Kovalenko, and Sebastian Siol. Synthesis and characteriza-
tion of the ternary nitride semiconductor zn2vn3: Theoretical prediction, combinatorial
screening, and epitaxial stabilization. Chemistry of Materials , 33(23):9306–9316, 2021.
doi:10.1021/acs.chemmater.1c03025.
[72] Alexander Wieczorek, Huagui Lai, Johnpaul Pious, Fan Fu, and Sebastian Siol.
Resolving oxidation states and x–site composition of sn perovskites through auger
parameter analysis in xps. Advanced Materials Interfaces , 10(7):2201828, 2023.
doi:https://doi.org/10.1002/admi.202201828.
[73] Alexander Wieczorek, Austin G. Kuba, Jan Sommerhäuser, Luis Nicklaus Caceres, Chris-
tian M. Wolff, and Sebastian Siol. Advancing high-throughput combinatorial aging studies
of hybrid perovskite thin films via precise automated characterization methods and machine
learning assisted analysis. J. Mater. Chem. A , 12:7025–7035, 2024. doi:10.1039/D3TA07274F.
[74] Moritz Wolf, Stephen J. Roberts, Wijnand Marquart, Ezra J. Olivier, Niels T. J. Luchters,
Emma K. Gibson, C. Richard A. Catlow, Jan. H. Neethling, Nico Fischer, and Michael Claeys.
Synthesis, characterisation and water–gas shift activity of nano-particulate mixed-metal (Al,
Ti) cobalt oxides. Dalton Transactions , 48(36):13858–13868, 1 2019. doi:10.1039/c9dt01634a.
[75] Moritz Wolf, Nico Fischer, and Michael Claeys. Surfactant-free synthesis of monodisperse
cobalt oxide nanoparticles of tunable size and oxidation state developed by factorial design.
Materials Chemistry and Physics ,213:305–312,2018. doi:10.1016/j.matchemphys.2018.04.021.
[76] Jiro Nishinaga, Takehiko Nagai, Takeyoshi Sugaya, Hajime Shibata, and Shigeru Niki. Single-
crystal Cu(In,Ga)Se2solar cells grown on GaAs substrates. Applied Physics Express , 11(8):
082302, 7 2018. doi:10.7567/apex.11.082302.
[77] M. D. Heinemann, R. Mainz, F. Österle, H. Rodriguez-Alvarez, D. Greiner, C. A. Kaufmann,
and T. Unold. Evolution of opto-electronic properties during film formation of complex semi-
conductors. Scientific Reports , 7(1), 4 2017. doi:10.1038/srep45463.
18
Page 19:
[78] Tze-BinSong, ZhenghaoYuan, MegumiMori, FaizanMotiwala, GideonSegev, EloïseMasque-
lier, Camelia V. Stan, Jonathan L. Slack, Nobumichi Tamura, and Carolin M. Sutter-Fella.
Revealing the dynamics of hybrid metal halide perovskite formation via multimodal in situ
probes.Advanced Functional Materials , 30(6), 12 2019. doi:10.1002/adfm.201908337.
19
Page 20:
Supporting information for opXRD: Open Experimental
Powder X-ray Diffraction Database
Daniel Hollarek , Henrik Schopmans , Jona Östreicher , Jonas Teufel , Bin Cao , Adie Alwen , Simon
Schweidler , Mriganka Singh , Tim Kodalle , Hanlin Hu , Gregoire Heymans , Maged Abdelsamie ,
Alexander Wieczorek , Siarhei Zhuk , Arthur Hardiagon , Ruth Schwaiger , François-Xavier Coudert ,
Moritz Wolf , Sebastian Siol , Carolin M. Sutter-Fella , Ben Breitung , Andrea M. Hodge , Tong-yi
Zhang , and Pascal Friederich*
*Corresponding author: pascal.friederich@kit.edu
S1 Description of opXRD files on Zenodo
The database comes in two zip archives, “opxrd.zip” and “opxrd_in_situ.zip”. The latter contains
thein-situdatawithhighlycorrelatedpatternsrecordedthroughtimeseriesmeasurements. Within
the .zip archives patterns are saved as .json files grouped in folders indicating the contributing
institution. If an institution contributed data from several projects, the contributed data is further
divided into folders indicating the research project. These research project folders are labeled
alphabetically in the order they are introduced in Section 3. Each .json file contains a pattern
recorded from an X-ray diffraction experiment. If available, the composition and structure of the
investigated sample and experiment conditions are also included in this file. Patterns belonging
to time series measurements are labeled with filenames that indicate the measurement series they
belong to and their order in that series.
S2 opXRD Python library usage
TheopXRDPythonlibraryallowsthedatasettobeaccessedthroughonesimplecommand: OpXRD.
load(root_dirpath) . Ifthe database is locally available under root_dirpath this command loads
the library from this location. If the database is not available locally at this location, the database
is automatically downloaded to root_dirpath .
1arXiv:2503.05577v2 [cond-mat.mtrl-sci] 10 Mar 2025
Page 21:
S3 Combined pattern plots
Figure (1) shows 50 randomly selected samples of the X-ray diffraction patterns found in each of
the research projects contributed to the opXRD database.
Figure 1: 50 randomly chosen X-ray diffraction patterns from each contributed dataset. The figure
shows data from the following datasets: a) EMPA, b) LBNL-A, c) LBNL-B, d) LBNL-C, e) USC,
f) INT, g) HKUST-A, h) HKUST-B, i) CNRS, j) IKFT.
2