Skip to content

Commit

Permalink
Merge branch 'pluskal-lab:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
anton-bushuiev authored Dec 27, 2024
2 parents 00260bf + 11d94f3 commit da041eb
Show file tree
Hide file tree
Showing 12 changed files with 189 additions and 3,570 deletions.
111 changes: 88 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,106 @@
# MassSpecGym: A benchmark for the discovery and identification of molecules

<p>
<a href="https://huggingface.co/datasets/roman-bushuiev/MassSpecGym"><img alt="Code style: black" src="https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-md-dark.svg" height="22px"></a>
<a href="https://huggingface.co/datasets/roman-bushuiev/MassSpecGym"><img alt="Dataset on Hugging Face" src="https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-md-dark.svg" height="22px"></a>
<a href="https://doi.org/10.48550/arXiv.2410.23326"><img alt="arXiv badge" src="https://img.shields.io/badge/arXiv-2410.23326-b31b1b.svg" height="22px"></a>
<a href="https://pypi.org/project/massspecgym"><img alt="Dataset on Hugging Face" src="https://img.shields.io/pypi/v/massspecgym" height="22px"></a>
<a href="https://github.com/pytorch/pytorch"> <img src="https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white" height="22px"></a>
<a href="https://github.com/Lightning-AI/pytorch-lightning"> <img src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white" height="22px"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg" height="22px"></a>
<a href="https://paperswithcode.com/sota/de-novo-molecule-generation-from-ms-ms?p=massspecgym-a-benchmark-for-the-discovery-and"><img alt="PWC" src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/massspecgym-a-benchmark-for-the-discovery-and/de-novo-molecule-generation-from-ms-ms" height="22px"></a>
<p>

<p align="center">
<img src="assets/MassSpecGym_abstract.svg" width="80%"/>
<img src="https://raw.githubusercontent.com/pluskal-lab/MassSpecGym/5d7d58af99947988f947eeb5bd5c6a472c2938b7/assets/MassSpecGym_abstract.svg" width="80%"/>
</p>

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra. The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems.
MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

- 💥 ***De novo* molecule generation** (MS/MS spectrum → molecular structure)
-**Bonus chemical formulae challenge** (MS/MS spectrum + chemical formula → molecular structure)
- 💥 **Molecule retrieval** (MS/MS spectrum → ranked list of candidate molecular structures)
-**Bonus chemical formulae challenge** (MS/MS spectrum → ranked list of candidate molecular structures with ground-truth chemical formulae)
- 💥 **Spectrum simulation** (molecular structure → MS/MS spectrum)
-**Bonus chemical formulae challenge** (molecular structure → MS/MS spectrum; evaluated on the retrieval of molecular structures with ground-truth chemical formulae)

The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.

<!-- [![Dataset on Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-md-dark.svg)](https://huggingface.co/datasets/roman-bushuiev/MassSpecGym) -->

📣 The paper will be available soon!
📚 Please see more details in our [NeurIPS 2024 Spotlight paper](https://arxiv.org/abs/2410.23326).

## Installation
## 📦 Installation

Installation steps:
Installation is available via [pip](https://pypi.org/project/massspecgym):

```bash
conda create -n massspecgym python=3.11
pip install massspecgym
```

If you use conda, we recommend creating and activating a new environment before installing MassSpecGym:

```bash
conda create -n massspecgym python==3.11
conda activate massspecgym
git clone https://github.com/pluskal-lab/MassSpecGym.git; cd MassSpecGym
pip install -e .[dev,notebooks]
```

For AMD GPUs, you may need to install PyTorch for ROCm:
If you are planning to run Jupyter notebooks provided in the repository or contribute to the project, we recommend installing the optional dependencies:

```bash
pip install -U torch==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
pip install massspecgym[notebooks, dev]
```

📣 Easier installation via `pip` will be available soon!
<!-- For AMD GPUs, you may need to install PyTorch for ROCm:
## MassSpecGym infrastructure
```bash
pip install -U torch==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
``` -->

## 🍩 Getting started with MassSpecGym

<p align="center">
<img src="assets/MassSpecGym_infrastructure.svg" width="80%"/>
<img src="https://raw.githubusercontent.com/pluskal-lab/MassSpecGym/5d7d58af99947988f947eeb5bd5c6a472c2938b7/assets/MassSpecGym_infrastructure.svg" width="80%"/>
</p>

## Train and evaluate your model 🚀
MassSpecGym’s infrastructure consists of predefined components that serve as building blocks for the implementation and evaluation of new models.

First of all, the MassSpecGym dataset is available as a [Hugging Face dataset](https://huggingface.co/datasets/roman-bushuiev/MassSpecGym) and can be downloaded within the code into a pandas DataFrame as follows.

```python
from massspecgym.utils import load_massspecgym
df = load_massspecgym()
```

Second, MassSpecGym provides [a set of transforms](https://github.com/pluskal-lab/MassSpecGym/blob/main/massspecgym/data/transforms.py) for spectra and molecules, which can be used to preprocess data for machine learning models. These transforms can be used in conjunction with the `MassSpecDataset` class (or its subclasses), resulting in a PyTorch `Dataset` object that implicitly applies the specified transforms to each data point. Note that `MassSpecDataset` also automatically downloads the dataset from the Hugging Face repository as needed.

```python
from massspecgym.data import MassSpecDataset
from massspecgym.transforms import SpecTokenizer, MolFingerprinter

dataset = MassSpecDataset(
spec_transform=SpecTokenizer(n_peaks=60),
mol_transform=MolFingerprinter(),
)
```

Third, MassSpecGym provides a `MassSpecDataModule`, a PyTorch Lightning [LightningDataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html) that automatically handles data splitting into training, validation, and testing folds, as well as loading data into batches.

```python
from massspecgym.data import MassSpecDataModule

data_module = MassSpecDataModule(
dataset=dataset,
batch_size=32
)
```

Finally, MassSpecGym defines evaluation metrics by implementing abstract subclasses of `LightningModule` for each of the MassSpecGym challenges: [`DeNovoMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/de_novo/base.py#L14), [`RetrievalMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/retrieval/base.py#L14), and [`SimulationMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/simulation/base.py#L12). To implement a custom model, you should inherit from the appropriate abstract class and implement the `forward` and `step` methods. This procedure is described in the next section. If you looking for more examples, please see the [`massspecgym/models`](https://github.com/pluskal-lab/MassSpecGym/tree/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models) folder.

## 🚀 Train and evaluate your model

MassSpecGym allows you to implement, train, validate, and test your model with a few lines of code. Built on top of PyTorch Lightning, MassSpecGym abstracts data preparation and splitting while eliminating boilerplate code for training and evaluation loops. To train and evaluate your model, you only need to implement your custom architecture and prediction logic.

Below is an example of how to implement a simple model based on [DeepSets](https://arxiv.org/abs/1703.06114) for the molecule retrieval task. The model is trained to predict the fingerprint of a molecule from its spectrum and then retrieves the most similar molecules from a set of candidates based on fingerprint similarity. For more examples, please see `notebooks/demo.ipynb`.
Below is an example of how to implement a simple model based on [DeepSets](https://arxiv.org/abs/1703.06114) for the molecule retrieval task. The model is trained to predict the fingerprint of a molecule from its spectrum and then retrieves the most similar molecules from a set of candidates based on fingerprint similarity. For more examples, please see [`notebooks/demo.ipynb`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/notebooks/demo.ipynb).

1. Import necessary modules:

Expand Down Expand Up @@ -150,17 +204,28 @@ trainer = Trainer(accelerator="cpu", devices=1, max_epochs=5)
trainer.fit(model, datamodule=data_module)
```

4. Test your model (the test API will be available soon):
4. Test your model:

```python
# Test
trainer.test(model, datamodule=data_module)
```

## TODO
## 🏅 Submit your results to the leaderboard

- [x] Croissant.
- [ ] Testing API.
- [ ] Optimize de novo evaluation metrics to run in parallel by workers initialized in the corresponding pl.Module constructor
- [ ] Link to documentation.
- [ ] Link to Papers With Code leaderboard (requires url to paper).
The MassSpecGym leaderboard is available on the [Papers with Code website](https://paperswithcode.com/dataset/massspecgym). To submit your results, please see the [following tutorial](https://github.com/paperswithcode/tutorials/blob/main/add_results.md).

## 🔗 References

If you use MassSpecGym in your work, please cite the following paper:

```bibtex
@article{bushuiev2024massspecgym,
title={MassSpecGym: A benchmark for the discovery and identification of molecules},
author={Roman Bushuiev and Anton Bushuiev and Niek F. de Jonge and Adamo Young and Fleming Kretschmer and Raman Samusevich and Janne Heirman and Fei Wang and Luke Zhang and Kai Dührkop and Marcus Ludwig and Nils A. Haupt and Apurva Kalia and Corinna Brungs and Robin Schmid and Russell Greiner and Bo Wang and David S. Wishart and Li-Ping Liu and Juho Rousu and Wout Bittremieux and Hannes Rost and Tytus D. Mak and Soha Hassoun and Florian Huber and Justin J. J. van der Hooft and Michael A. Stravs and Sebastian Böcker and Josef Sivic and Tomáš Pluskal},
year={2024},
eprint={2410.23326},
url={https://arxiv.org/abs/2410.23326},
doi={10.48550/arXiv.2410.23326}
}
```
Binary file added assets/MassSpecGym_abstract.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit da041eb

Please sign in to comment.