+ Statement of need
+ In the field of materials informatics, where materials science
+ intersects with machine learning, benchmarks play a crucial role in
+ assessing model performance and enabling fair comparisons among
+ various tools and models. Typically, these benchmarks focus on
+ evaluating the accuracy of predictive models for materials properties,
+ utilizing well-established metrics such as mean absolute error and
+ root-mean-square error to measure performance against actual
+ measurements. A standard practice involves splitting the data into two
+ parts, with one serving as training data for model development and the
+ other as test data for assessing performance
+ (Dunn
+ et al., 2020).
+ However, benchmarking generative models, which aim to create
+ entirely new data rather than focusing solely on predictive accuracy,
+ presents unique challenges. While significant progress has been made
+ in standardizing benchmarks for tasks like image generation and
+ molecule synthesis, the field of crystal structure generative modeling
+ lacks this level of standardization (this is separate from machine
+ learning interatomic potentials, which have the robust and
+ comprehensive
+ matbench-discovery
+ (Riebesell
+ et al., 2024) and
+ Jarvis
+ Leaderboard benchmarking frameworks
+ (Choudhary
+ et al., 2023)). Molecular generative modeling benefits from
+ widely adopted benchmark platforms such as Guacamol
+ (Brown
+ et al., 2019) and Moses
+ (Polykovskiy
+ et al., 2020), which offer easy installation, usage guidelines,
+ and leaderboards for tracking progress. In contrast, existing
+ evaluations in crystal structure generative modeling, as seen in CDVAE
+ (Xie
+ et al., 2022), FTCP
+ (Ren
+ et al., 2022), PGCGM
+ (Zhao
+ et al., 2023), CubicGAN
+ (Zhao
+ et al., 2021), and CrysTens
+ (Alverson
+ et al., 2022), lack standardization, pose challenges in terms
+ of installation and application to new models and datasets, and lack
+ publicly accessible leaderboards. While these evaluations are valuable
+ within their respective scopes, there is a clear need for a dedicated
+ benchmarking platform to promote standardization and facilitate robust
+ comparisons.
+ In this work, we introduce
+ matbench-genmetrics, a materials benchmarking
+ platform for crystal structure generative models. We use concepts from
+ molecular generative modeling benchmarking to create a set of
+ evaluation metrics—validity, coverage, novelty, and uniqueness—which
+ are broadly defined as follows:
+
+
+ Validity: a measure of how well the generated
+ materials match the distribution of the training dataset
+
+
+ Coverage: the ability to successfully predict
+ known materials which have been held out
+
+
+ Novelty: generating structures which are close
+ matches to examples in the training set are penalized
+
+
+ Uniqueness: the number of repeats within the
+ generated structures
+
+
+ matbench-genmetrics is comprised of two
+ namespace packages. The first namespace package is
+ matbench_genmetrics.core, which provides the
+ following features:
+
+
+ GenMatcher: A class for calculating
+ matches between two sets of structures
+
+
+ GenMetrics: A class for calculating
+ validity, coverage, novelty, and uniqueness metrics
+
+
+ MPTSMetrics: class for loading
+ mp_time_split data, calculating time-series
+ cross-validation metrics, and saving results
+
+
+ Fixed benchmark classes for 10, 100, 1000, and 10000 generated
+ structures
+
+
+ Additionally, we introduce the
+ matbench_genmetrics.mp_time_split namespace
+ package as a complement to
+ matbench_genmetrics.core. It provides a
+ standardized dataset and cross-validation splits for evaluating the
+ mentioned four metrics. Time-based splits have been utilized in
+ materials informatics model validation, such as predicting future
+ thermoelectric materials via word embeddings
+ (Tshitoyan
+ et al., 2019), searching for efficient solar photoabsorption
+ materials through multi-fidelity optimization
+ (Palizhati
+ et al., 2022), and predicting future materials stability trends
+ via network models
+ (Aykol
+ et al., 2019). Recently, Hu et al.
+ (Zhao
+ et al., 2023) used what they call a rediscovery metric,
+ referred to here as a coverage metric in line with molecular
+ benchmarking terminology, to evaluate crystal structure generative
+ models. While time-series splitting wasn’t used, they showed that
+ after generating millions of structures, only a small percentage of
+ held-out structures had matches. These results highlight the
+ difficulty (and robustness) of coverage tasks. By leveraging timeline
+ metadata from the Materials Project database
+ (Jain
+ et al., 2013) and creating a standard time-series splitting of
+ data, matbench_genmetrics.mp_time_split enables
+ rigorous evaluation of future discovery performance.
+ The matbench_genmetrics.mp_time_split
+ namespace package provides the following features:
+
+
+ downloading and storing snapshots of Materials Project crystal
+ structures via pymatgen
+ (Ong
+ et al., 2013)
+
+
+ modification of pymatgen search criteria
+ to fetch custom datasets
+
+
+ utilities for post-processing Materials Project entries
+
+
+ convenience methods to access the snapshot dataset
+
+
+ predefined scikit-learn TimeSeriesSplit
+ cross-validation splits
+ (Ong
+ et al., 2013)
+
+
+ In future work, metrics will serve as multi-criteria filters to
+ prevent manipulation. Standalone metrics can be “hacked” by generating
+ nonsensical structures for novelty or including training structures to
+ inflate validity scores. To address this, multiple criteria are
+ considered simultaneously for each generated structure, such as
+ novelty, uniqueness, and filtering rules like non-overlapping atoms,
+ stoichiometry, or checkCIF criteria
+ (Spek,
+ 2020). Additional filters based on machine learning models can
+ be applied for properties like negative formation energy, energy above
+ hull, ICSD classification, and coordination number. Applying
+ machine-learning-based structural relaxation using M3GNet
+ (Chen
+ & Ong, 2022) (e.g., as in CrysTens
+ (Alverson
+ et al., 2022)) before filtering is also of interest.
+ Contributions related to multi-criteria filtering, enhanced validity
+ filters, and implementing a benchmark submission system and public
+ leaderboard are welcome.
+ We believe that the matbench-genmetrics
+ ecosystem is a robust and easy-to-use benchmarking platform that will
+ help propel novel materials discovery and targeted crystal structure
+ inverse design. We hope that practioners of crystal structure
+ generative modeling will adopt
+ matbench-genmetrics, contribute improvements
+ and ideas, and submit their results to the planned public
+ leaderboard.
+
+