Python version: 3.8
Install the required packages:
pip install -r requirements.txt
Python version: 3.8
Install the required packages denoising_diffusion_pytorch, rdkit, deepchem and transformers:
pip install rdkit deepchem transformers
cd diffusion1D/model
pip install -e .
Python version: 3.8
Install the required packages diffusionLM, transformers (customized) and others:
pip install mpi4py
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -e diffusionLM/improved-diffusion/
pip install -e diffusionLM/transformers/
pip install spacy==3.2.6
pip install datasets==2.0.0
pip install huggingface_hub==0.16.4
pip install wandb deepchem torchsummary
Prepare the data used for training in .csv file with two columns, the separation marker is "\t"
- 1st column: "mol_smiles" (SMILES code for the monomer)
- 2nd column: "conductivity" ("1" is high conductivity, "0" is low conductivity)
- The datasets are stored in .json format, please check the
diffusionLM/datasets
for examples.
- data preprocessing (data_config)
- build the model (model_config)
- train the model (train_config)
- generate candidates (generate_config)
- evaluation (6 metrics including validity, novelty, uniqueness, synthesizability, similarity and diversity)
The demos are shown in minGPT_pipeline.ipynb
, diffusion1D_pipeline.ipynb
, diffusionLM_pipeline.ipynb
- For
minGPT_pipeline.ipynb
,diffusion1D_pipeline.ipynb
, all the steps in pipeline can be executed in the notebook.
- For
diffusionLM_pipeline.ipynb
, the notebook generates the the bash scripts for training and generation. The scripts will be stored underdiffusionLM/improved-diffusion
.
To run the training:
cd diffusionLM/improved-diffusion
bash train_conditional.sh or bash train_unconditional.sh
The model checkpoints will be stored in ```diffusionLM/improved-diffusion/diffusion_models```
To run the generation:
cd diffusionLM/improved-diffusion
bash generate_conditional.sh or bash generate_unconditional.sh
The generated output will be stored in diffusionLM/improved-diffusion/generation_outputs
The checkpoints of pretrained model at different epochs can be obtained here:https://drive.google.com/drive/folders/1M1VjgUnFDospbmVSnr17JdUcUa-_4O79?usp=sharing. Please put the checkpoints files under minGPT/ckpts/
.
The checkpoints of pretrained model at different epochs can be obtained here: https://drive.google.com/drive/folders/1kFnKtnmuQLTNDZ7BJG2ZhoJKGWoXlI--?usp=sharing. Please put the checkpoints files under diffusion1D/ckpts/
.
The checkpoints of pretrained model at different epochs can be obtained here: https://drive.google.com/drive/folders/1ndLNhRZu8TL2Ni7VL8Q9GRAeX9fFVOq0?usp=sharing. Please put the whole checkpoints folder and files under diffusionLM/improved-diffusion/diffusion_models/
.
The github repositories that are referenced for this code include:
https://github.com/karpathy/minGPT
https://github.com/lucidrains/denoising-diffusion-pytorch
https://github.com/XiangLi1999/Diffusion-LM
In this work, we copied the minGPT model from the original repository by Karpathy at https://github.com/karpathy/minGPT at commit 37baab7 (Jan 8, 2023). This unchanged copy is saved in https://github.com/TRI-AMDD/PolyGen/tree/main/minGPT/model.
If you use PolyGen, please cite the following:
@article{lei2023self,
title={A self-improvable Polymer Discovery Framework Based on Conditional Generative Model},
author={Lei, Xiangyun and Ye, Weike and Yang, Zhenze and Schweigert, Daniel and Kwon, Ha-Kyung and Khajeh, Arash},
journal={arXiv preprint arXiv:2312.04013},
year={2023}
}
@article{yang2023novo,
title={De novo design of polymer electrolytes with high conductivity using gpt-based and diffusion-based generative models},
author={Yang, Zhenze and Ye, Weike and Lei, Xiangyun and Schweigert, Daniel and Kwon, Ha-Kyung and Khajeh, Arash},
journal={arXiv preprint arXiv:2312.06470},
year={2023}
}