MolSetRep is a Python library that provides encoders and machine learning models for molecular set representation learning. The following models that are ready to be used with Pytorch Lightning are included.
INSTALLATION • MODELS • EXAMPLES • REPRODUCE • CITE
Overview of set-based and -enhanced models. All implemented models consist of three parts: An encoding or embedding layer, a set representation layer and, finally, a readout / MLP layer. (a) The simplest molecular set representation model MSR1 takes molecules as input and encodes each atom as 133-dimensional binary vectors $\vec{a}i$} into molecular sets $A_i$. These sets with differing cardinalities are passed into a RepSet set representation layer and read out by a regression or classification MLP. (b) The dual molecular set representation model MSR2 encodes the atoms and bonds of molecules into two distinct sets $A_i$ and $B_i$ and passes them to two separate RepSet layers whose outputs $A{out}$ and
pip install molsetrep
The code has been tested on Windows 11, Ubuntu 22.04, and macOS 13. Please let us know whether you experience any issues on other operating systems or versions. All required dependencies are resolved during the installation from pip, the required package versions are configured in setup.cfg
. The installation should take less than 1 minute, but may take longer if you are using a proxy.
The following models / architectures and associated encoders are available. If you prefer to not use lightning, you can also use the torch modules directly.
-
LightningSRClassifier
- Wraps
SRClassifier
- Takes molecules encoded by
SingleSetEncoder
as input
- Wraps
-
LightningSRRegressor
- Wraps
SRRegressor
- Takes molecules encoded by
SingleSetEncoder
as input
- Wraps
-
LightningDualSRClassifier
- Wraps
DualSRClassifier
- Takes molecules encoded by
DualSetEncoder
as input
- Wraps
-
LightningDualSRRegressor
- Wraps
DualSRRegressor
- Takes molecules encoded by
DualSetEncoder
as input
- Wraps
LightningSRGNNClassifier
- Wraps
SRGNNClassifier
- Takes molecules encoded by
GraphEncoder
as input
- Wraps
LightningSRGNNRegressor
- Wraps
SRGNNRegressor
- Takes molecules encoded by
GraphEncoder
as input
- Wraps
LightningSRGNNClassifier
- Wraps
SRGNNClassifier
- Takes molecules encoded by
GraphEncoder
as input
- Wraps
LightningSRGNNRegressor
- Wraps
SRGNNRegressor
- Takes molecules encoded by
GraphEncoder
as input
- Wraps
LightningDualSRClassifier
- Wraps
DualSRClassifier
- Takes molecules encoded by
RXNSetEncoder
as input
- Wraps
LightningDualSRRegressor
- Wraps
DualSRRegressor
- Takes molecules encoded by
RXNSetEncoder
as input
- Wraps
An example of molecular set representation learning for molecular property prediction using single sets, dual sets, and set-enhanced GNNs can be found in the notebook example/property_prediction.ipynb.
For this example, make sure you have downloaded the PDBbind database (or any other data set you may want to use) and prepared it using the script scripts/preprocess_pdbbind.py.
An example of molecular set representation learning for protein-ligand binding affinity prediction using dual sets can be found in the notebook example/property_prediction.ipynb.
An example of molecular set representation learning for reaction yield prediction using dual sets can be found in the notebook example/property_prediction.ipynb.
The shell scripts in the folder evaluation
can be used to reproduce the data reported in the manuscript. However, the results may vary depending on the hardware used.
For protein-ligand binding affinity prediction, make sure you have downloaded the PDBbind database (or any other data set you may want to use) and prepared it using the script scripts/preprocess_pdbbind.py.
@article{boulougouri_vandergheynst_probst_2023,
title={Molecular set representation learning},
DOI={10.26434/chemrxiv-2023-fk7kf},
journal={ChemRxiv},
publisher={Cambridge Open Engage},
author={Boulougouri, Maria and Vandergheynst, Pierre and Probst, Daniel},
year={2023}
}