Skip to content

A reinforcement learning approach to generating novel chromophores with desired properties.

Notifications You must be signed in to change notification settings

MorganRO8/ChromoGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChromoGen

Overview

ChromoGen is an adapted version of the project MolGen that is aimed at producing novel chromophores. This was accomplished by training an XGBoost model on a small dataset of chromophores and their quantum yield, absorption max, and emission max. These models are then implemented in a reward function that takes the molecule generated by the AI, and computes a score based on the difference between the predicted properties by the XGBoost models and the desired properties set in the code. Below I have outlined how to install and use the code in its current state, as well as a breakdown of the function of the files in case you would also like to modify this as I have to generate your own novel molecules using custom reward functions.

Project Structure

Main Scripts and Configuration

  • main.py: The primary script for running the project. It orchestrates the entire workflow from data preprocessing, model training, to evaluation. It integrates various aspects of the project like dataset handling, model training, and evaluation.
  • environment.yml: Conda environment file for dependency management.

Source Code (src directory)

  • datasets: Scripts for various dataset types like bs1_dataset.py, scaffold_dataset.py.
  • models: Model definitions including bert.py, transformer.py.
  • tokenizers: Tokenizer implementations, e.g., BPETokenizer.py.
  • train: Training and evaluation scripts like train.py, evaluate.py.
  • utils: Utility scripts including mol_utils.py, reward_fn.py. The utils.py script provides an argument parser for command-line configuration and various utility functions.

Data (data directory)

  • gdb: Folder for the gdb13 dataset subsets.
  • models: Contains pre-trained models and configurations.
  • pIC50: CSV files with specific datasets.
  • results: Directory for output results.
  • tokenizers: Tokenizer configurations and models.

Example Commands

The command I use, that can be found in the ChromoGen.sh script:

python main.py --load_pretrained --pretrained_path ./data/models/gpt_pre_rl_gdb13.pt --do_eval --dataset_path ./data/gdb/gdb13/gdb13.smi --tokenizer Char --tokenizer_path ./data/tokenizers/gdb13ScaffoldCharTokenizer.json --rl_epochs 250 --rl_size 250000 --batch_size 256 

Note however that this command is designed for a very powerful system. Here's some more example commands:

Scenario 1: Basic Training on Limited Hardware

Goal: Basic training with a small dataset on a machine with limited memory and no GPU.

python chromogen.py --batch_size 128 --epochs 5 --learning_rate 0.005 --do_train --tokenizer Char --rl_batch_size 200 --rl_epochs 50 --rl_max_len 100 --rl_size 10000 --reward_fns ChromoGen --no_batched_rl --device cpu

Scenario 2: Advanced Training with GPU

Goal: Advanced training with a larger dataset on a high-end machine with GPU.

python chromogen.py --batch_size 1024 --epochs10  --learning_rate 0.001 --do_train --load_pretrained --pretrained_path './path/to/pretrained/model.pt' --tokenizer BPE --rl_batch_size 1000 --rl_epochs 200 --rl_max_len 150 --rl_size 50000 --reward_fns ChromoGen --device cuda

Scenario 3: Fine-Tuning a Pre-Trained Model

Goal: Fine-tuning a pre-trained model on a standard workstation.

python chromogen.py --batch_size 256 --epochs 3 --learning_rate 0.001 --load_pretrained --pretrained_path './path/to/pretrained/model.pt' --do_train --tokenizer Char --rl_batch_size 500 --rl_epochs 100 --rl_max_len 150 --rl_size 25000 --reward_fns ChromoGen --do_eval --eval_steps 5 --device cuda

Scenario 4: Evaluating Model Performance

Goal: Evaluating the model's performance periodically during training.

python chromogen.py --do_train --do_eval --eval_steps 10 --batch_size 512 --epochs 5 --learning_rate 0.002 --rl_batch_size 500 --rl_epochs 100 --rl_size 25000 --reward_fns ChromoGen --device cuda

Scenario 5: Custom Reward Function with Property Predictor

Goal: Using a custom reward function with an external property predictor.

python chromogen.py --batch_size 512 --epochs 7 --learning_rate 0.001 --do_train --tokenizer Char --rl_batch_size 500 --rl_epochs 120 --rl_max_len 150 --rl_size 30000 --reward_fns ['ChromoGen', 'QED'] --predictor_paths ['./path/to/predictor1.model', './path/to/predictor2.model'] --device cuda

Scenario 6: Training Property Predictor

Goal: Training the Property Predictor model specifically.

python chromogen.py --train_predictor --predictor_batch_size 64 --predictor_epochs 15 --predictor_dataset_path './path/to/dataset.csv' --predictor_save_path './path/to/save/predictor_model.pt' --device cpu

These examples show how to run the project with different configurations, tailored to the specific needs and resources of the user. Many more configurations are possible, and I would highly recommend reading up on all the options available.

Data Management

There is an issue that I have run into when transferring results produced by the code, which is that sometimes the file paths are too long. So, I would recommend running the code close to the root directory of whatever drive you store it on, otherwise some graphs may not be able to generate. The results, aside from the model files, are fairly small in terms of data, and can be found in the /data/results folder.

Setup and Installation

For starters, you'll need to download the code. The easiest way (in my opinion) is to download and extract the zip by clicking the '<> Code v' button above, and then the download zip button.

To run the code, you will need to create an environment with all the dependencies. The easiest way to do this will be by creating a conda environment with the environment.yml provided. Note that more than likely you will need to find the correct version of cuda, pytorch, and torchvision for your system. However, the provided environment.yml can be used as a starting point, and if you notice errors related to versions, you can adjust as needed. Feel free to submit issues if you are looking for help setting up.

You will then need to obtain one of the gdb13 subsets provided by the Reymond group here: https://gdb.unibe.ch/downloads/ You will want to decide which one to use according to the system you are using. If you are not running on industrial grade computational hardware, I would recommend the random 1 million dataset. If you have access to more computational resources, the largest subset (AB) takes between 64GB and 128GB of ram to use without running out of memory. You should place the dataset you choose to obtain in the data/gdb/gdb13/ folder, and specify the filename when you call the code, or simply rename it to 'gdb13.smi'.

Implementing Custom Reward Functions

The reward_fn.py script provides a framework for defining and implementing custom reward functions. Key features include:

  • Integration with cheminformatics and machine learning libraries like RDKit and XGBoost.
  • Functions like get_map4 for generating molecular fingerprints from SMILES strings.
  • A structured approach for defining reward functions, allowing flexibility and extendability.

To implement a custom reward function, users can define a new class in reward_fn.py following the existing structure and examples.

Detailed Usage Guide

The project can be configured and run using various command-line arguments defined in utils.py. Here is a detailed breakdown of all the arguments that can be passed:

  1. --batch_size
    Description: Sets the batch size for the language modeling task.
    Type: Integer
    Default Value: 512
    Usage Context: Increase or decrease based on the available memory and desired training speed. Larger batch sizes can speed up training but require more memory.

  2. --epochs
    Description: Specifies the number of training epochs for the language modeling task.
    Type: Integer
    Default Value: 3
    Usage Context: Increase to train the model for more iterations, which might improve performance but will take longer.

  3. --learning_rate
    Description: Sets the learning rate for the model's optimization.
    Type: Float
    Default Value: 0.001 (1e-3)
    Usage Context: Adjust to fine-tune model training. Lower values can lead to more stable but slower convergence.

  4. --load_pretrained
    Description: Option to load a pre-trained model instead of training a new one.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Use when you want to leverage a pre-trained model to save time or improve performance.

  5. --do_train
    Description: Flag to train a model with the language modeling task.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Set this flag when you need to train the model from scratch or fine-tune it.

  6. --pretrained_path
    Description: File path to the pre-trained model.
    Type: String
    Default Value: './data/models/gpt_pre_rl_gdb13.pt'
    Usage Context: Change this to point to a different pre-trained model file as needed.

  7. --tokenizer
    Description: Specifies the type of tokenizer to use.
    Type: String
    Default Value: 'Char'
    Choices: ['Char', 'BPE']
    Usage Context: Choose 'BPE' for Byte Pair Encoding tokenizer if preferred, which might be more efficient for certain languages.

  8. --rl_batch_size
    Description: Sets the number of episodes to compute a batch for policy gradient during RL training.
    Type: Integer
    Default Value: 500
    Usage Context: Modify based on computational resources and the desired balance between exploration and learning efficiency.

  9. --rl_epochs
    Description: Number of epochs to run for the policy gradient stage in RL.
    Type: Integer
    Default Value: 100
    Usage Context: Increase for more thorough training at the expense of longer training time.

  10. --discount_factor
    Description: Discount factor for future rewards in RL.
    Type: Float
    Default Value: 0.99
    Usage Context: Adjust to change the emphasis between immediate and future rewards. Higher values prioritize future rewards.

  11. --rl_max_len
    Description: Maximum size of molecule the model can generate during the RL stage.
    Type: Integer
    Default Value: 150
    Usage Context: Set according to the maximum expected molecular size. Increasing this may allow generation of larger molecules but could increase complexity.

  12. --rl_size
    Description: Number of molecules to generate at each evaluation step during RL.
    Type: Integer
    Default Value: 25000
    Usage Context: Adjust based on computational resources and how exhaustive you want the evaluation to be.

  13. --reward_fns
    Description: Reward functions used during the RL stage.
    Type: String List
    Default Value: ['ChromoGen']
    Choices: ['QED', 'Sim', 'Anti Cancer', 'LIDI', 'Docking', 'ChromoGen']
    Usage Context: Select different reward functions based on the desired properties of the generated molecules.

  14. --do_eval
    Description: Flag to evaluate the model during the RL stage.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Set this flag when you want to periodically evaluate the model's performance during training.

  15. --eval_steps
    Description: Frequency of evaluation steps during the RL stage.
    Type: Integer
    Default Value: 10
    Usage Context: Decrease to evaluate more frequently, which can provide more granular feedback but may slow down overall training.

  16. --rl_temprature
    Description: Temperature parameter during the RL stage.
    Type: Float
    Default Value: 1
    Usage Context: Adjust to control the randomness in the policy distribution. Higher values result in more exploration.

  17. --multipliers
    Description: Multipliers for the Property Predictor reward function.
    Type: String List
    Default Value: ["lambda x: x"]
    Usage Context: Modify to change the influence of different properties on the reward function.

  18. --no_batched_rl
    Description: Option to train the RL model without generating batches.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Use this flag to train the RL model in a non-batched manner, which may be necessary for certain computational setups.

  19. --predictor_paths
    Description: File paths for the Property Predictor reward function.
    Type: String List
    Default Value: [None]
    Usage Context: Specify the paths to the predictor models if using custom reward functions based on external property predictors.

  20. --save_path
    Description: Path where the results will be saved.
    Type: String
    Default Value: './data/results/[current_datetime]'
    Usage Context: Change to specify a different directory or naming convention for saving results.

  21. --eval_size
    Description: Number of molecules to generate during the final evaluation.
    Type: Integer
    Default Value: 25000
    Usage Context: Adjust based on how comprehensive the final evaluation should be.

  22. --eval_max_len
    Description: Maximum size of molecule the model can generate during the final evaluation stage.
    Type: Integer
    Default Value: 150
    Usage Context: Set based on the desired complexity or size of the molecules in the final evaluation.

  23. --temprature
    Description: Softmax temperature during the final evaluation.
    Type: Float
    Default Value: 1
    Usage Context: Adjust to control the randomness in the final evaluation phase.

  24. --n_embd
    Description: Model embedding size.
    Type: Integer
    Default Value: 512
    Usage Context: Increase for potentially richer representations, with a trade-off in computational demand.

  25. --d_model
    Description: Size of the feedforward network model.
    Type: Integer
    Default Value: 1024
    Usage Context: Modify based on the desired complexity of the model's internal representations.

  26. --n_layers
    Description: Number of LSTM/decoder layers.
    Type: Integer
    Default Value: 4
    Usage Context: Adjust to increase or decrease model depth, affecting its ability to learn complex patterns.

  27. --num_heads
    Description: Number of attention heads in the transformer model.
    Type: Integer
    Default Value: 8
    Usage Context: More heads allow the model to attend to different parts of the input simultaneously, potentially improving learning.

  28. --block_size
    Description: Maximum length of token sequence for the model.
    Type: Integer
    Default Value: 512
    Usage Context: Increase if dealing with longer sequences, but be mindful of memory constraints.

  29. --proj_size
    Description: Projection size for the attention mechanism.
    Type: Integer
    Default Value: 256
    Usage Context: Tweak to adjust the dimensionality of attention outputs.

  30. --attn_dropout_rate
    Description: Dropout rate for attention layers.
    Type: Float
    Default Value: 0.1
    Usage Context: Adjust to prevent overfitting, especially with larger datasets or models.

  31. --proj_dropout_rate
    Description: Dropout rate for projection layers.
    Type: Float
    Default Value: 0.1
    Usage Context: Similar to attention dropout, useful for regularization.

  32. --resid_dropout_rate
    Description: Dropout rate for residual layers.
    Type: Float
    Default Value: 0.1
    Usage Context: Controls regularization in residual connections, useful for larger networks.

  33. --predictor_dataset_path
    Description: File path for the dataset used by the Property Predictor.
    Type: String
    Default Value: './data/csvs/bs1.csv'
    Usage Context: Change to point to a different dataset as needed for custom predictions.

  34. --predictor_tokenizer_path
    Description: Path to the tokenizer used by the Property Predictor.
    Type: String
    Default Value: './data/tokenizers/predictor_tokenizer.json'
    Usage Context: Update to use a different tokenizer according to the dataset characteristics.

  35. --predictor_save_path
    Description: Where to save the trained Property Predictor model.
    Type: String
    Default Value: './data/models/predictor_model.pt'
    Usage Context: Modify to specify a different saving location or naming convention for the Property Predictor model.

  36. --train_predictor
    Description: Flag to train the Property Predictor.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Set this flag if training a Property Predictor model is required.

  37. --predictor_batch_size
    Description: Batch size for training the Property Predictor.
    Type: Integer
    Default Value: 32
    Usage Context: Adjust based on available memory and desired training speed.

  38. --predictor_epochs
    Description: Number of training epochs for the Property Predictor.
    Type: Integer
    Default Value: 10
    Usage Context: Increase to train the Property Predictor for more iterations, potentially improving its performance.

  39. --predictor_n_embd, --predictor_d_model, --predictor_n_layers, --predictor_num_heads, --predictor_block_size, --predictor_proj_size, --predictor_attn_dropout_rate, --predictor_proj_dropout_rate, --predictor_resid_dropout_rate Description: These arguments mirror the earlier n_embd, d_model, n_layers, num_heads, block_size, proj_size, attn_dropout_rate, proj_dropout_rate, resid_dropout_rate but specifically for the Property Predictor model.
    Usage Context: Adjust these parameters to customize the architecture and training of the Property Predictor model.

  40. --dataset_path
    Description: Path to the dataset for language modeling.
    Type: String
    Default Value: './data/gdb/gdb13/gdb13.smi'
    Usage Context: Change to use a different dataset for training the language model.

  41. --tokenizer_path
    Description: Path to the tokenizer.
    Type: String
    Default Value: './data/tokenizers/gdb13ScaffoldCharTokenizer.json'
    Usage Context: Update to use a different tokenizer if necessary, based on the specific dataset or desired tokenization method.

  42. --device
    Description: Specifies the computing device to be used ('cuda' for GPU, 'cpu' for CPU).
    Type: String
    Default Value: 'cuda'
    Usage Context: Set to 'cpu' if GPU is not available or desired for computation.

  43. --model
    Description: Choice of model architecture.
    Type: Enum (defined by ModelOpt)
    Default Value: ModelOpt.GPT
    Usage Context: Select a different model architecture if required, based on the available options in ModelOpt.

  44. --use_scaffold
    Description: Whether to use scaffold in the model.
    Action: Store True
    Default Value: False (not specified)
    Usage Context: Set this flag if incorporating scaffold is needed for the model's approach to molecular generation.

References and Acknowledgements

Thanks to my research team, including my PI Dr. Alice R. Walker as well as Dr. Mark A. Hix for assisting me.

@article{mazuz2023molecule, title={Molecule generation using transformers and policy gradient reinforcement learning}, author={Mazuz, Eyal and Shtar, Guy and Shapira, Bracha and Rokach, Lior}, journal={Scientific Reports}, volume={13}, number={1}, pages={8799}, year={2023}, publisher={Nature Publishing Group UK London} }

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

About

A reinforcement learning approach to generating novel chromophores with desired properties.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published