ChromoGen is an adapted version of the project MolGen that is aimed at producing novel chromophores. This was accomplished by training an XGBoost model on a small dataset of chromophores and their quantum yield, absorption max, and emission max. These models are then implemented in a reward function that takes the molecule generated by the AI, and computes a score based on the difference between the predicted properties by the XGBoost models and the desired properties set in the code. Below I have outlined how to install and use the code in its current state, as well as a breakdown of the function of the files in case you would also like to modify this as I have to generate your own novel molecules using custom reward functions.
main.py
: The primary script for running the project. It orchestrates the entire workflow from data preprocessing, model training, to evaluation. It integrates various aspects of the project like dataset handling, model training, and evaluation.environment.yml
: Conda environment file for dependency management.
datasets
: Scripts for various dataset types likebs1_dataset.py
,scaffold_dataset.py
.models
: Model definitions includingbert.py
,transformer.py
.tokenizers
: Tokenizer implementations, e.g.,BPETokenizer.py
.train
: Training and evaluation scripts liketrain.py
,evaluate.py
.utils
: Utility scripts includingmol_utils.py
,reward_fn.py
. Theutils.py
script provides an argument parser for command-line configuration and various utility functions.
gdb
: Folder for the gdb13 dataset subsets.models
: Contains pre-trained models and configurations.pIC50
: CSV files with specific datasets.results
: Directory for output results.tokenizers
: Tokenizer configurations and models.
The command I use, that can be found in the ChromoGen.sh script:
python main.py --load_pretrained --pretrained_path ./data/models/gpt_pre_rl_gdb13.pt --do_eval --dataset_path ./data/gdb/gdb13/gdb13.smi --tokenizer Char --tokenizer_path ./data/tokenizers/gdb13ScaffoldCharTokenizer.json --rl_epochs 250 --rl_size 250000 --batch_size 256
Note however that this command is designed for a very powerful system. Here's some more example commands:
Goal: Basic training with a small dataset on a machine with limited memory and no GPU.
python chromogen.py --batch_size 128 --epochs 5 --learning_rate 0.005 --do_train --tokenizer Char --rl_batch_size 200 --rl_epochs 50 --rl_max_len 100 --rl_size 10000 --reward_fns ChromoGen --no_batched_rl --device cpu
Goal: Advanced training with a larger dataset on a high-end machine with GPU.
python chromogen.py --batch_size 1024 --epochs10 --learning_rate 0.001 --do_train --load_pretrained --pretrained_path './path/to/pretrained/model.pt' --tokenizer BPE --rl_batch_size 1000 --rl_epochs 200 --rl_max_len 150 --rl_size 50000 --reward_fns ChromoGen --device cuda
Goal: Fine-tuning a pre-trained model on a standard workstation.
python chromogen.py --batch_size 256 --epochs 3 --learning_rate 0.001 --load_pretrained --pretrained_path './path/to/pretrained/model.pt' --do_train --tokenizer Char --rl_batch_size 500 --rl_epochs 100 --rl_max_len 150 --rl_size 25000 --reward_fns ChromoGen --do_eval --eval_steps 5 --device cuda
Goal: Evaluating the model's performance periodically during training.
python chromogen.py --do_train --do_eval --eval_steps 10 --batch_size 512 --epochs 5 --learning_rate 0.002 --rl_batch_size 500 --rl_epochs 100 --rl_size 25000 --reward_fns ChromoGen --device cuda
Goal: Using a custom reward function with an external property predictor.
python chromogen.py --batch_size 512 --epochs 7 --learning_rate 0.001 --do_train --tokenizer Char --rl_batch_size 500 --rl_epochs 120 --rl_max_len 150 --rl_size 30000 --reward_fns ['ChromoGen', 'QED'] --predictor_paths ['./path/to/predictor1.model', './path/to/predictor2.model'] --device cuda
Goal: Training the Property Predictor model specifically.
python chromogen.py --train_predictor --predictor_batch_size 64 --predictor_epochs 15 --predictor_dataset_path './path/to/dataset.csv' --predictor_save_path './path/to/save/predictor_model.pt' --device cpu
These examples show how to run the project with different configurations, tailored to the specific needs and resources of the user. Many more configurations are possible, and I would highly recommend reading up on all the options available.
There is an issue that I have run into when transferring results produced by the code, which is that sometimes the file paths are too long. So, I would recommend running the code close to the root directory of whatever drive you store it on, otherwise some graphs may not be able to generate. The results, aside from the model files, are fairly small in terms of data, and can be found in the /data/results folder.
For starters, you'll need to download the code. The easiest way (in my opinion) is to download and extract the zip by clicking the '<> Code v' button above, and then the download zip button.
To run the code, you will need to create an environment with all the dependencies. The easiest way to do this will be by creating a conda environment with the environment.yml provided. Note that more than likely you will need to find the correct version of cuda, pytorch, and torchvision for your system. However, the provided environment.yml can be used as a starting point, and if you notice errors related to versions, you can adjust as needed. Feel free to submit issues if you are looking for help setting up.
You will then need to obtain one of the gdb13 subsets provided by the Reymond group here: https://gdb.unibe.ch/downloads/ You will want to decide which one to use according to the system you are using. If you are not running on industrial grade computational hardware, I would recommend the random 1 million dataset. If you have access to more computational resources, the largest subset (AB) takes between 64GB and 128GB of ram to use without running out of memory. You should place the dataset you choose to obtain in the data/gdb/gdb13/ folder, and specify the filename when you call the code, or simply rename it to 'gdb13.smi'.
The reward_fn.py
script provides a framework for defining and implementing custom reward functions. Key features include:
- Integration with cheminformatics and machine learning libraries like RDKit and XGBoost.
- Functions like
get_map4
for generating molecular fingerprints from SMILES strings. - A structured approach for defining reward functions, allowing flexibility and extendability.
To implement a custom reward function, users can define a new class in reward_fn.py
following the existing structure and examples.
The project can be configured and run using various command-line arguments defined in utils.py
. Here is a detailed breakdown of all the arguments that can be passed:
-
--batch_size
Description: Sets the batch size for the language modeling task.
Type: Integer
Default Value: 512
Usage Context: Increase or decrease based on the available memory and desired training speed. Larger batch sizes can speed up training but require more memory. -
--epochs
Description: Specifies the number of training epochs for the language modeling task.
Type: Integer
Default Value: 3
Usage Context: Increase to train the model for more iterations, which might improve performance but will take longer. -
--learning_rate
Description: Sets the learning rate for the model's optimization.
Type: Float
Default Value: 0.001 (1e-3)
Usage Context: Adjust to fine-tune model training. Lower values can lead to more stable but slower convergence. -
--load_pretrained
Description: Option to load a pre-trained model instead of training a new one.
Action: Store True
Default Value: False (not specified)
Usage Context: Use when you want to leverage a pre-trained model to save time or improve performance. -
--do_train
Description: Flag to train a model with the language modeling task.
Action: Store True
Default Value: False (not specified)
Usage Context: Set this flag when you need to train the model from scratch or fine-tune it. -
--pretrained_path
Description: File path to the pre-trained model.
Type: String
Default Value: './data/models/gpt_pre_rl_gdb13.pt'
Usage Context: Change this to point to a different pre-trained model file as needed. -
--tokenizer
Description: Specifies the type of tokenizer to use.
Type: String
Default Value: 'Char'
Choices: ['Char', 'BPE']
Usage Context: Choose 'BPE' for Byte Pair Encoding tokenizer if preferred, which might be more efficient for certain languages. -
--rl_batch_size
Description: Sets the number of episodes to compute a batch for policy gradient during RL training.
Type: Integer
Default Value: 500
Usage Context: Modify based on computational resources and the desired balance between exploration and learning efficiency. -
--rl_epochs
Description: Number of epochs to run for the policy gradient stage in RL.
Type: Integer
Default Value: 100
Usage Context: Increase for more thorough training at the expense of longer training time. -
--discount_factor
Description: Discount factor for future rewards in RL.
Type: Float
Default Value: 0.99
Usage Context: Adjust to change the emphasis between immediate and future rewards. Higher values prioritize future rewards. -
--rl_max_len
Description: Maximum size of molecule the model can generate during the RL stage.
Type: Integer
Default Value: 150
Usage Context: Set according to the maximum expected molecular size. Increasing this may allow generation of larger molecules but could increase complexity. -
--rl_size
Description: Number of molecules to generate at each evaluation step during RL.
Type: Integer
Default Value: 25000
Usage Context: Adjust based on computational resources and how exhaustive you want the evaluation to be. -
--reward_fns
Description: Reward functions used during the RL stage.
Type: String List
Default Value: ['ChromoGen']
Choices: ['QED', 'Sim', 'Anti Cancer', 'LIDI', 'Docking', 'ChromoGen']
Usage Context: Select different reward functions based on the desired properties of the generated molecules. -
--do_eval
Description: Flag to evaluate the model during the RL stage.
Action: Store True
Default Value: False (not specified)
Usage Context: Set this flag when you want to periodically evaluate the model's performance during training. -
--eval_steps
Description: Frequency of evaluation steps during the RL stage.
Type: Integer
Default Value: 10
Usage Context: Decrease to evaluate more frequently, which can provide more granular feedback but may slow down overall training. -
--rl_temprature
Description: Temperature parameter during the RL stage.
Type: Float
Default Value: 1
Usage Context: Adjust to control the randomness in the policy distribution. Higher values result in more exploration. -
--multipliers
Description: Multipliers for the Property Predictor reward function.
Type: String List
Default Value: ["lambda x: x"]
Usage Context: Modify to change the influence of different properties on the reward function. -
--no_batched_rl
Description: Option to train the RL model without generating batches.
Action: Store True
Default Value: False (not specified)
Usage Context: Use this flag to train the RL model in a non-batched manner, which may be necessary for certain computational setups. -
--predictor_paths
Description: File paths for the Property Predictor reward function.
Type: String List
Default Value: [None]
Usage Context: Specify the paths to the predictor models if using custom reward functions based on external property predictors. -
--save_path
Description: Path where the results will be saved.
Type: String
Default Value: './data/results/[current_datetime]'
Usage Context: Change to specify a different directory or naming convention for saving results. -
--eval_size
Description: Number of molecules to generate during the final evaluation.
Type: Integer
Default Value: 25000
Usage Context: Adjust based on how comprehensive the final evaluation should be. -
--eval_max_len
Description: Maximum size of molecule the model can generate during the final evaluation stage.
Type: Integer
Default Value: 150
Usage Context: Set based on the desired complexity or size of the molecules in the final evaluation. -
--temprature
Description: Softmax temperature during the final evaluation.
Type: Float
Default Value: 1
Usage Context: Adjust to control the randomness in the final evaluation phase. -
--n_embd
Description: Model embedding size.
Type: Integer
Default Value: 512
Usage Context: Increase for potentially richer representations, with a trade-off in computational demand. -
--d_model
Description: Size of the feedforward network model.
Type: Integer
Default Value: 1024
Usage Context: Modify based on the desired complexity of the model's internal representations. -
--n_layers
Description: Number of LSTM/decoder layers.
Type: Integer
Default Value: 4
Usage Context: Adjust to increase or decrease model depth, affecting its ability to learn complex patterns. -
--num_heads
Description: Number of attention heads in the transformer model.
Type: Integer
Default Value: 8
Usage Context: More heads allow the model to attend to different parts of the input simultaneously, potentially improving learning. -
--block_size
Description: Maximum length of token sequence for the model.
Type: Integer
Default Value: 512
Usage Context: Increase if dealing with longer sequences, but be mindful of memory constraints. -
--proj_size
Description: Projection size for the attention mechanism.
Type: Integer
Default Value: 256
Usage Context: Tweak to adjust the dimensionality of attention outputs. -
--attn_dropout_rate
Description: Dropout rate for attention layers.
Type: Float
Default Value: 0.1
Usage Context: Adjust to prevent overfitting, especially with larger datasets or models. -
--proj_dropout_rate
Description: Dropout rate for projection layers.
Type: Float
Default Value: 0.1
Usage Context: Similar to attention dropout, useful for regularization. -
--resid_dropout_rate
Description: Dropout rate for residual layers.
Type: Float
Default Value: 0.1
Usage Context: Controls regularization in residual connections, useful for larger networks. -
--predictor_dataset_path
Description: File path for the dataset used by the Property Predictor.
Type: String
Default Value: './data/csvs/bs1.csv'
Usage Context: Change to point to a different dataset as needed for custom predictions. -
--predictor_tokenizer_path
Description: Path to the tokenizer used by the Property Predictor.
Type: String
Default Value: './data/tokenizers/predictor_tokenizer.json'
Usage Context: Update to use a different tokenizer according to the dataset characteristics. -
--predictor_save_path
Description: Where to save the trained Property Predictor model.
Type: String
Default Value: './data/models/predictor_model.pt'
Usage Context: Modify to specify a different saving location or naming convention for the Property Predictor model. -
--train_predictor
Description: Flag to train the Property Predictor.
Action: Store True
Default Value: False (not specified)
Usage Context: Set this flag if training a Property Predictor model is required. -
--predictor_batch_size
Description: Batch size for training the Property Predictor.
Type: Integer
Default Value: 32
Usage Context: Adjust based on available memory and desired training speed. -
--predictor_epochs
Description: Number of training epochs for the Property Predictor.
Type: Integer
Default Value: 10
Usage Context: Increase to train the Property Predictor for more iterations, potentially improving its performance. -
--predictor_n_embd, --predictor_d_model, --predictor_n_layers, --predictor_num_heads, --predictor_block_size, --predictor_proj_size, --predictor_attn_dropout_rate, --predictor_proj_dropout_rate, --predictor_resid_dropout_rate Description: These arguments mirror the earlier n_embd, d_model, n_layers, num_heads, block_size, proj_size, attn_dropout_rate, proj_dropout_rate, resid_dropout_rate but specifically for the Property Predictor model.
Usage Context: Adjust these parameters to customize the architecture and training of the Property Predictor model. -
--dataset_path
Description: Path to the dataset for language modeling.
Type: String
Default Value: './data/gdb/gdb13/gdb13.smi'
Usage Context: Change to use a different dataset for training the language model. -
--tokenizer_path
Description: Path to the tokenizer.
Type: String
Default Value: './data/tokenizers/gdb13ScaffoldCharTokenizer.json'
Usage Context: Update to use a different tokenizer if necessary, based on the specific dataset or desired tokenization method. -
--device
Description: Specifies the computing device to be used ('cuda' for GPU, 'cpu' for CPU).
Type: String
Default Value: 'cuda'
Usage Context: Set to 'cpu' if GPU is not available or desired for computation. -
--model
Description: Choice of model architecture.
Type: Enum (defined by ModelOpt)
Default Value: ModelOpt.GPT
Usage Context: Select a different model architecture if required, based on the available options in ModelOpt. -
--use_scaffold
Description: Whether to use scaffold in the model.
Action: Store True
Default Value: False (not specified)
Usage Context: Set this flag if incorporating scaffold is needed for the model's approach to molecular generation.
Thanks to my research team, including my PI Dr. Alice R. Walker as well as Dr. Mark A. Hix for assisting me.
@article{mazuz2023molecule, title={Molecule generation using transformers and policy gradient reinforcement learning}, author={Mazuz, Eyal and Shtar, Guy and Shapira, Bracha and Rokach, Lior}, journal={Scientific Reports}, volume={13}, number={1}, pages={8799}, year={2023}, publisher={Nature Publishing Group UK London} }
970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.