Skip to content

Latest commit

 

History

History
63 lines (45 loc) · 4.01 KB

README.md

File metadata and controls

63 lines (45 loc) · 4.01 KB

Quantifying the performance of machine learning models in materials discovery [code repository]

This repository contain data and processing scripts to reproduce work performed in the article: Quantifying machine learning model performance in materials discovery, Borg et al., arXiv.2210.13587 [cond-mat.mtrl-sci] (2022). DOI: 10.48550/arXiv.2210.13587.

Simulated Sequential Learning (SL) Quickstart

  1. install required packages (example below using Anaconda)
conda create -n [ENV_NAME] pip numpy
conda activate [ENV_NAME]
pip install -r requirements.txt
  1. Setup configuration files
  • To perform a simulated SL run, create an SL configuration file (e.g. test.yaml) and a dataset configuration file (e.g. matbench_expt_gap_test.yaml.) These files define the parameters for parsing a dataset and configuring the SL run.
  • This repo is currently set up for creating datasets from Matbench and Starrydata2 to address design challenges that connect chemical compositions (i.e. chemcial formula) to a real-valued physical property.
    • Matbench: The latest matbench dataset will be queried and returned
    • Starrydata2: Uses data queried August 2021. Processing defined in Starrydata processing.
  1. Run 1-execute_sl_workflow.ipynb. Path for the configuration file(s) can be set in cell 2.

  2. Run 2-quickplot.ipynb. Quickplot takes a single SL run as input (i.e. for one target range) and generates a figure with 6 subplots:

  • (a) Discovery yield as a function of iteration
  • (b) Model error as a function of iteration
  • (c) Discovery probability as a function of iteration
  • (d-f) Discovery accleration factor for n = 1, 3, and 5 target materials.
  1. Scripts to generate figures shown in the manuscript are stored in simulated_SL with seperate script for every figure.

   

Configuration file parameters:

  • Dataset parameters:
    • dataset (str): The name of the input dataset to be processed (processing steps defined in load_datasets.py)
    • output (str): output property (must be column in dataset)
    • categoricals (str, null): Categorical features

 

  • Starrydata specific parameters:
    • comp_class (str, null): Selects a subset of records based on composition (e.g. '111-type') using logic we have predefined here.
    • material_family (str, null): Starrydata generated label for material family.
    • filtered (True/False): Performs filtering of starrydata datasets based on physically-relevant property values (e.g. filters on records where ZT < 2).
    • sample_form (str, null): Performs filtering of starrydata based on sample form (e.g. 'bulk').

 

  • SL parameters:
    • n_sample (int): Number of datapoints to sample / downselect from raw data. Set to 0 to use full dataset.
    • n_training (int): Number of training rows to start SL process.
    • iterations (int): Number of SL iterations to perform.
    • trials (int): Number of trials (i.e. independent SL processes) to perform.
    • batch (int): Number of candidates to select at each SL iteration.
    • discovery_break_number (int): Number of candidates to find before halting SL process. If set to 0, SL will continue for n_iterations.
    • poi (str, null): Point of interest. Index of point to be included in training set. Forces training set to include "point of interest". Typically set to null.
    • holdout_fraction (float): Percent of dataset to holdout (test).
    • targets (list): Min and max of the target range, e.g. [90, 100] will target 10th decile materials.