Covid-19 Drug Discovery using Genetic Algorithm and Deep Learning

Forkwell Coronavirus Hack: Drug Discovery

This is an submission to the Forkwell Coronavirus Hack Competition hosted by Forkwell under category Drug Discovery.

The goal of this category is to create a novel small molecule or find existing drug on market which able to stop of interfere with the coronavirus lifecyle. Therefore, one of the approaches to this is to find out drugs or ligands which able to bind with the coronavirus main protease 6LU7.

Several research and experiment had been conducted and recorded in DrugBank paper which leads to our evaluation target.

Below are the samples of existing drugs that had been experiment with the coronavirus binding:

Drugs/Ligands	Binding Score
Remdesivir	-7.4
Umifenovir	-6.1
Favipiravir	-5.6
Lopinavir	-6.6
Ritonavir	-6.2
Galidesivir	-5.6
Favipiravir	-5.6
Triazavirin	-5.9
Chloroquine	-5.6
Darunavir	-7.2
TMC-310911	-8.9

Acknowledgement

Our team would like to thanks all the mentors from forkwell corona-virus for giving us the chances in working on this project and contributing to the corona-virus outbreak.

This work is continuous progress from the repository Deep_Learning_Coronavirus_Cure created by Matt O Connor which is also one of our mentors in this hackathon.

Next, we would like to thanks jhjensen2 with his repository Graph-based genetic algorithmGB-GA. The details of his work can be find under his paper entitled "A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space" link

Team Details

Team Name: TaoFuFa

Quek Yao Jing Skyquek-github
(Janson) Liew Kok Foo Janson-L-github
Tang Li Ho Skyquek-github ------------------ Li Ho
Kwong Tung Nan Skyquek-github -------------Kwong

Requirements

The requirements are identical to the original repository Deep_Learning_Coronavirus_Cure

Changes to Original files

In this repository, we introduced a new concept which we name it as local Genetic Algorithm (local-GA), which is an optimization method in evolutionary computing. In our method, we plan to keep changes to a minimum as the repository by Matt is well-maintained and can act as a good starting point.

Our local-GA utilizes the cross-over and mutation to search for the molecule based on the fitness function.

We implement local-GA in 2 parts, which are before the Transfer Learning and before we export the generated molecule to sdf file.

In our approaches, this is overview of our local GA:

Population: The number of original molecule

The initial population depends on the molecule we compute before passing it to the local GA. There are 2 part that we called the local-GA which is before the transfer learning and before exporting the sdf files. So, the first population is the 70 molecule selected based on score, similarity, logP and also random generated. In second local-GA, number of validated molecule from 5000 molecule generated after transfer learning is used.

Mating Pool: The number we want to pass generation by generation

We select the number of molecule we need from the population to the mating pool. The selection criteria is based on the fitness function. This number is also the number of molecule returned after every generation.

Cross-Over: Explain how the chemical cross-over work...................TANG LI HO

LI HO, update here.

Mutation: Explain how the chemical mutation work.......................TANG LI HO

LI HO, update here.

Fitness Fucntion: The fitness function of the molecule is based on the logP value. From this article, the oral administration of drug should be lower than 5 and best in the range of 1.35 - 1.8.

The fitness function is the evaluation criteria in every single generation.

LogP is used in the pharmaceutical/biotech industries to understand the behavior of drug molecules in the body. Drug candidates are often screened according to logP, among other criteria, to help guide drug selection and analog optimization. This is because lipophilicity is a major determining factor in a compound’s absorption, distribution in the body, penetration across vital membranes and biological barriers, metabolism and excretion (ADME properties). According to ‘Lipinski’s Rule of 5’ (developed at Pfizer) the logP of a compound intended for oral administration should be <5. A more lipophilic compound:

• Will have low aqueous solubility, compromising bioavailability. If an adequate concentration of a drug cannot be reached or maintained, even the most potent in-vitro substance cannot be an effective drug.

• May be sequestered by fatty tissue and therefore difficult to excrete; in turn leading to accumulation that will impact the systemic toxicity of the substance.

• May not be ideal for penetration through certain barriers. A drug targeting the central nervous system (CNS) should ideally have a logP value around 2;2 for oral and intestinal absorption the idea value is 1.35–1.8, while a drug intended for sub-lingual absorption should have a logP value >5.

Not only does logP help predict the likely transport of a compound around the body. It also affects formulation, dosing, drug clearance, and toxicity. Though it is not the only determining factor in these issues, it plays a critical role in helping scientists limit the liabilities of new drug candidates.

Approaches

Original Approaches:

Global-Generation 0:

LSTM-CHEM to train ChEMBL Database
From LSTM CHEM, we predict 10k of data
Check the validation
Compute Tanimoto similarity, select 1000 only
Give ID to the 1000 smile, add the HIV, and other drugs SMILE manually….
Save all in master table and manually check from PyRX to get the affinity

While each Global-Generation < n,

From the master table (load from Global-Generation before this),

35 based on score
5 based on the similarity
5 based on the weight
5 based on the random mutation

From 55, we do Transfer Learning. We then generate 5k of data and then perform validation and similarity and generate master table.

Modified Approaches:

Global-Generation 0:

LSTM-CHEM to train ChEMBL Database
From LSTM CHEM, we predict 10k of data
Check the validation
Compute Tanimoto similarity, select 1000 only
Give ID to the 1000 smile, add the HIV, and others drugs SMILE manually….
Save all in master table and manually check from PyRX to get the affinity

While each Global-Generation < n,

From the master table (load from Global-Generation before this),

Select the 35 based on score
10 based on the similarity
10 based on logP
10 based on weights
5 based on the random generation

7.1) From the number of molecule we select at 7, we pass to Local GA to obtain 10 molecule which contains logP 1.35 - 1.8

7.2) Combined all the molecule from 7 and 7.1 and pass to 8

By using 90 molecule, we perform Transfer Learning and generate 5k of data.
From the 5k of data, we do validation to make sure it is valid molecule.
After that, we generate another 50 molecule using local-GA which has logP 1.35-1.8.
Validate the 50 molecule generated using local-GA and combined with molecule from 9.
Export to sdf and evaluate with PyRX. Note

There are few ideas we think of improving:

Change the LSTM network to Generative Adversarial Network (GAN), but after discussion we found out that its is not necessary as LSTM is good enough for this project. GAN is computing expensive and requires much more training time.
From the evaluation, we plan to use neural network to perform prediction, but after we think twice we found out that the neural network is just the estimation of the affirnity which is dangerous as its contains errors in the prediction.

Challenge

The first challenge we faced is computation power limitation. VINA is a tool that utilizes CPU only without the option to utilize GPU for accelerating the docking process. 1500 ligands require roughly 18 hours to calculate binding affinity to every ligands with exhaustiveness of 8. This operation is performed on a Windows machine with Core i7-6700K. This severely limits the things we can do to our algorithm design as we need to reserve a lot of time to calculating the binding affinity of each ligand with the main protease of coronavirus.

Future work

Increase the number of generations
Change local-GA parameters
Change the base network to Generative Adversarial Network (GAN)
To compute binding affinity for all drugs in market(~450,000) and using the results to generate gen0
GPU based docking for faster evaluation

Reference

https://drugbank.s3-us-west-2.amazonaws.com/assets/blog/COVID-19_Web.pdf
https://www.acdlabs.com/download/app/physchem/making_sense.pdf
https://github.com/mattroconnor/deep_learning_coronavirus_cure
https://github.com/isayev/ReLeaSE
https://github.com/sirimullalab/dlscore
https://gitlab.com/cheminfIBB/pafnucy
https://github.com/jensengroup/GB-GA
https://github.com/jensengroup/GB-GM
https://chemrxiv.org/articles/Graph-based_Genetic_Algorithm_and_Generative_Model_Monte_Carlo_Tree_Search_for_the_Exploration_of_Chemical_Space/7240751
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6198856/
https://arxiv.org/abs/1703.10603

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Covid-19 Drug Discovery using Genetic Algorithm and Deep Learning

Forkwell Coronavirus Hack: Drug Discovery

Acknowledgement

Team Details

Requirements

Changes to Original files

Approaches

Original Approaches:

Modified Approaches:

Challenge

Future work

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Covid-19 Drug Discovery using Genetic Algorithm and Deep Learning

Forkwell Coronavirus Hack: Drug Discovery

Acknowledgement

Team Details

Requirements

Changes to Original files

Approaches

Original Approaches:

Modified Approaches:

Challenge

Future work

Reference