SaprotHub v1 (will be deprecated in future)

Catalog

0: Preliminary
1: Installation
- 1.1: Switch your runtime type to GPU
2: Train and Share your model
- 2.1: Train your model
- 2.2: Upload your model
3: Use your model to predict

0.1: Task Overview

Different models are designed for different tasks, so it's essential to understand which type your task belongs to.

📍To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.

Task type

Here are the task types and their description, so you can recognize your task type based on your task description and objectives.

For Classification and Regression prediction task:

Protein-level Classification Task
Protein-level Regression Task
Residue-level Classification Task
Protein-protein Classification Task
Protein-protein Regression Task

For Zero-shot prediciton task:

Mutational effect prediction
Inverse folding prediction

Classification and Regression prediction task

Train a model based on SaProt and use it to make prediction.

Task Type	Task Description	Example
Protein-level Classification	Classify protein sequences.	- Fold Class Prediction - Localization Prediction - Function Prediction
Protein-level Regression	Predict the value of some property of a protein sequence.	- Thermal Stability Prediction - Fluorescence Intensity Prediction - Binding Affinity Prediction
Residue-level Classification	Classify the amino acids in a protein sequence.	- Secondary Structure Prediction - Binding Site Prediction - Active Site Prediction
Protein-protein Classification	Predict if there is interaction between the two proteins.	- Protein-Protein Interaction (PPI) Prediction - Interaction Type Classification Disease - Associated Interaction Prediction
Protein-protein Regression	Predict the ability of interaction between the two proteins.	- Interaction Strength Prediction - Binding Free Energy Calculation - Interaction Affinity Prediction

Zero-shot prediciton task

Directly use SaProt (650M) to make prediction.

Task Type	Task Description	Example
Mutational Effect Prediction	Predict the mutational effect based on the wild type sequence and mutation information.	- Enzyme Activity Prediction - Virus Fitness Prediction - Driver Mutation Prediction
Inverse Folding Prediction	Predict the residue sequence given the structure backbone.	- Enzyme Function Optimization - Protein Stability Enhancement - Protein Folding Prediction

0.2: Dataset Overview

You can use your private data to train and predict. Below are the various data formats corresponding to different data types.

What is SA(Structure-aware) Sequence

We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.

The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.

Here you can convert your data into SA Sequence format.

Data Type

Single AA Sequence
Single SA Sequence
Single UniProt ID
Single PDB/CIF Structure
Multiple AA Sequences
Multiple SA Sequences
Multiple UniProt IDs
Multiple PDB/CIF Structures
SaprotHub Dataset

For tasks that require two protein sequences as input (pair classification & pair regression) :

A pair of AA Sequences
A pair of SA Sequences
A pair of UniProt IDs
A pair of PDB/CIF Structures
Multiple pairs of AA Sequences
Multiple pairs of SA Sequences
Multiple pairs of UniProt IDs
Multiple pairs of PDB/CIF Structures

How to find a SaprotHub Dataset

Go to Official SaProtHub Repository to find some datasets.
Copy the Dataset ID for future use.

Scripts for dataset preparation

	Link
Get Structure-Aware Sequence	here
Convert .fa file to .csv dataset (data type:`Multiple AA sequences`)	here
Randomly split your dataset	here

0.3: Model Overview

Model type

Official pretrained SaProt (35M)
Official pretrained SaProt (650M)
Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer
Multi-model on SaprotHub

Model type	Used for	Description	Input
`Official pretrained SaProt (35M)`	Training	Train a protein language model based on SaProt(35M) with your dataset	-
`Official pretrained SaProt (650M)`	Training	Train a protein language model based on SaProt(650M) with your dataset	-
`Trained by yourself on ColabSaprot`	Continually training, Prediction	Once you have completed training the model, select this option to use the model you have trained on ColabSaprot for continual training or prediction	Select the model from the dropdown menu
`Shared by peers on SaprotHub`	Continually training, Prediction	Use models shared on SaprotHub for continual training or prediction	Enter the model ID
`Saved in your local computer`	Continually training, Prediction	Use models saved on your local computer (.zip file which were saved when finishing training) for continual training or prediction	Upload the .zip file
`Multi-models on SaprotHub`	Prediction	Ensemble multiple models shared on SaprotHub for prediction. Each sample will be predicted using multiple models. Note that: For classification tasks, voting will be used to determine the final predicted category; for regression tasks, the predicted values from each model will be averaged.	Enter the model IDs

How to find a model on SaprotHub

Go to Official SaProtHub Repository to find some model based on your requirements.
Copy the Model ID for future use.

0.4: Contribute to SaprotHub

Join SaprotHub Organization

Before contributing to SaprotHub, you need to join the SaprotHub Huggingface Organization to gain write access to the subset of repos within the Organization that you have created.

Contribute to SaprotHub

You have two ways to contribute to SaprotHub:

1. Transfer your model to SaprotHub (Recommended)

Once you have uploaded the model to your Huggingface repository using ColabSaprot, you can directly transfer your model to SaprotHub.

2. Create a new model repository and upload model files

You can manually create a new model repository on SaprotHub, and then upload the model files to this repository.

1.1: Switch your runtime type to GPU

⚠️IMPORTANT⚠️

Before installing SaProt, please SWITCH YOUR RUNTIME TYPE TO GPU!!!

Current runtime type

You can check the current runtime type in the upper right corner of the page.

If the current runtime type is CPU, you need to switch it to GPU (either the free T4 or the paid A100) for a better training experience.

Switch runtime type

Please follow the steps below to switch the runtime to GPU:

Click the dropdown button
Select option "Change runtime type"
Select a GPU
Click "Save" button
Each time you switch the runtime, all code blocks need to be re-executed.

2.1: Train your model

Video

Task type

Click here for detailed information on each task type.

Protein-level Classification
Protein-level Regression
Residue-level Acid Classification
Protein-protein Classification
Protein-protein Regression

Base model

Click here for detailed information on each model type.

Official pretrained SaProt (35M)
Official pretrained SaProt (650M)
Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer

Training dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

Dataset should be a .csv file with three required columns: sequence, label and stage

The content of column sequence depends on your data type. See the table
The content of column label depends on your task type. See the table
The column stage indicate whether the sample is used for training, validation, or testing.

⚠️IMPORTANT⚠️:

Ensure your dataset includes samples for all three stages. The values are: train, valid, test.
Due to GPU memory limits, protein sequences used for training (where the stage column is train) will be truncated to the first 1024 amino acids, while sequences for validation and testing will remain uncut.

Data type	Interface	Input	Example
`Multiple AA Sequences`	An upload button	`file`: the .csv file containing three columns: `sequence`, `label` and `stage`
`Multiple SA Sequences`	An upload button	`file`: the .csv file containing three columns: `sequence`, `label` and `stage`
`Multiple UniProt IDs`	An upload button	`file`: the .csv file containing three columns: `sequence`, `label` and `stage`
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing five columns: `sequence`, `type`, `chain`, `label` and `stage` `sturcture files`: a .zip file containing all the structure files	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.
`SaprotHub Dataset`	An input box	`Dataset ID`: SaprotHub Dataset ID	Find more datasets on SaprotHub

Format of comlum `label`

Example of comlum label for different task type (the data type in these examples is Multiple SA sequences)

Task type	Label	Description
Protein-level classification	Category index starting from zero	- The task have 2 protein sequence categories: 0, 1. - Each protein sequence has a corresponding category index.
Protein-level regression	Numerical values	- Each protein sequence has a corresponding numerical label to represent the value of some property.
Residue-level classification	A list of category indices for each amino acid	- The task have 3 animo acid categories: 0, 1, 2. - Each animo acid has a corresponding category index.

Dataset for protein-protein tasks

Training config

Regular config

Parameter	Description
`batch_size`	`batch_size` depends on the number of training samples. "Adaptive" (default choice) refers to automatic batch size according to your data size. If your training data set is large enough, you can use 32, 64, 128, 256, ..., others can be set to 8, 4, 2. Note that: You can not use a larger batch size if you use the Colab default T4 GPU. Strongly suggest you subscribe to Colab Pro for an A100 GPU.).
`max_epochs`	`max_epochs` refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust `max_epochs` to control training duration. Note that: The max running time of colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user
`learning_rate`	`learning_rate` affects the convergence speed of the model. Through experimentation, we have found that `5.0e-4` is a good default value for base model `Official pretrained SaProt (650M)` and `1.0e-3` for `Official pretrained SaProt (35M)`.

Advanced Config

For users with some machine learning background who want to further customize the training process, we offer some advanced settings. Simply expand the code cell and modify the values of the variables to take effect.

Parameter	Description
`GPU_batch_size`	The `GPU_batch_size` determines the number of samples in a batch on a single GPU. Note that: You need to modify both `GPU_batch_size` and `accumulate_grad_batches` simultaneously and the `batch_size` selected in the dropdown menu will be overridden.
`accumulate_grad_batches`	Due to hardware limitations, we may not be able to use enough samples in a single batch to perform a gradient update for the model. Therefore, we can adjust the `accumulate_grad_batches` parameter, which controls how many batches of samples the model will use for a single gradient update. Note that: You need to modify both `GPU_batch_size` and `accumulate_grad_batches` simultaneously and the `batch_size` selected in the dropdown menu will be overridden.
`num_workers`	`num_workers` specifies the number of threads or processes used for parallel data loading and processing.
`seed`	The `seed` can control the pseudorandom number sequence.
`r`	`r` represents the rank of the low-rank decomposition in LoRA.
`lora_dropout`	`lora_dropout` specifies the dropout rate applied during LoRA training to prevent overfitting.
`lora_alpha`	`lora_alpha` is a scaling factor used to balance the contribution of the low-rank components in LoRA.

Interrupt Training to Avoid Overfitting

After each validation (default validation interval is half an epoch), the model with the highest performance on the validation set will be automatically saved.

⚠️Therefore, avoid interrupting the training before the first validation (or before the curve plots are generated), as this will prevent the model from being created.

During the training process, if you observe that the current model is at risk of overfitting by looking at the curve, you can interrupt the training at any time.

After interruption, the program will automatically test the model's generalization performance on the test set.

You can interrupt the training in the following ways:

Use the shortcut key: Command/Ctrl + M + I
In the top menu of the Colab interface, select Runtime -> Interrupt execution.

Instruction

Complete the configs and then click the run button
Complete additional input and then click the “Start Trainining” button
Monitor the training process by the progress bar and the plots
Check test result and save the model

2.2: Upload your model

You can upload the model to your Huggingface repository and then contribute it to SaprotHub.

Config

You need to add some description for your model:

name: The name of your model.
description: The description of your model (which task is your model used for).
label_meanings:
- For classification model, please provide detailed information about the meanings of all labels.
- For regression model, please provide the numerical range of value.

Example

For classification models:

Parameter	Value
name	Subcellular_Localization
description	This model is used for the Subcellular Localization Classification Task. It takes a protein sequence as input and outputs which of the 10 categories the protein belongs to.
label_meanings	Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell.membrane, Endoplasmic.reticulum, Plastid, Golgi.apparatus, Lysosome/Vacuole, Peroxisome

For regression models:

Parameter	Value
name	Thermostability
description	This model is used for the Thermostability Regression Task. It takes a protein sequence as input and outputs the thermostability of the protein.
label_meanings	Label corresponds to the protein melting temperature (Tm) normalized using the Min-Max normalization method.

You can also edit the README.md to provide more information in the model card, such as Dataset description, Performance and so on.

Instruction

Click run button, Find your token and Login Huggingface
Complete model card config and then click run button to upload
Check your model repo

3.1: Classification Regression Prediction

Video

Task type

Click here for detailed information on each task type.

Protein-level Classification
Protein-level Regression
Residue-level Acid Classification
Protein-protein Classification
Protein-protein Regression

Model

Click here for detailed information on each model type.

Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer
Multi-models on SaprotHub

Dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

Data type	Interface	Input	Example
`Single AA Sequence`	An input box	`sequence`: the amino acid sequence	`sequence`: MEETMKLATM
`Single SA Sequence`	An input box	`sequence`: the structure-aware sequence	`sequence`: MdEvEvTvMpKpLpApTaMp
`Single UniProt ID`	An input box	`sequence`: the UniProt ID	`sequence`: O95905
`Single PDB/CIF structure`	Two input box and an upload button	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. `structure file`: the .pdb/.cif structure file	`type`: AF2 `chain`: A `structure file`: O95905.pdb
`Multiple AA Sequences`	An upload button	`file`: the .csv file containing one column: `sequence`
`Multiple SA Sequences`	An upload button	`file`: the .csv file containing one column: `sequence`
`Multiple UniProt IDs`	An upload button	`file`: the .csv file containing one column: `sequence`
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing three columns: `sequence`, `type` and `chain` `structure files`: a .zip file containing all the structure files
`SaprotHub Dataset`	An input box	`Dataset ID`: SaprotHub Dataset ID	Find more datasets on SaprotHub

Dataset for protein-protein tasks

Instruction

Complete the configs and then click the run button
Complete additional input and then click the “Make Prediction” button
Check and download prediction result

3.2: Mutational Effect Prediction

Mutation Task

Single-site or Multi-site mutagenesis
Saturation mutagenesis

Model

Default model is Official pretrained SaProt (650M).

Mutation information

Here is the detail about the representation of mutation information:

mode	mutation information
Single-site mutagenesis	H87Y
Multi-site mutagenesis	H87Y:V162M:P179L:P179R

For Single-site mutagenesis, we use a term like "H87Y" to denote the mutation, where the first letter represents the original amino acid, the number in the middle represents the mutation site (indexed starting from 1), and the last letter represents the mutated amino acid,
For Multi-site mutagenesis, we use a colon ":" to connect each single-site mutations, such as "H87Y:V162M:P179L:P179R".

Mutation dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

For `Saturation mutagenesis` task

The mutation dataset is the same as the dataset used for classification/regression prediction tasks.

For `Single-site or Multi-site mutagenesis` task

One more information are required: mutation.

Data type	Interface	Input	Example
`Single SA Sequence`	Two input box	`sequence`: the structure-aware sequence `mutation`: the mutation information	`sequence`: MdEvEvTvMpKpLpAp `mutation`: M1H:E2L:E3Q:T4A:M5P:K6Y:L7V:A8P
`Single UniProt ID`	Two input box	`sequence`: the UniProt ID `mutation`: the mutation information	`sequence`: O95905 `mutation`: H87Y:V162M:P179L
`Single PDB/CIF structure`	Three input box and an upload button	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. `structure file`: the .pdb/.cif structure file `mutation`: the mutation information	`type`: AF2 `chain`: A `structure file`: O95905.pdb `mutation`: H87Y:V162M:P179L
`Multiple SA Sequences`	An upload button	`file`: the .csv file containing two columns: `sequence` and `mutation`
`Multiple UniProt IDs`	An upload button	`file`: the .csv file containing two columns: `sequence` and `mutation`
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing four columns: `sequence`, `type`, `chain` and `mutation` `structure files`: a .zip file containing all the structure files

Instruction

Complete task config and then click run button to apply
Provide dataset (and mutation information for Single-site or Multi-site mutagenesis), and then click the “Mutational Effect Predict” button
Download the result

3.3: Inverse Folding Prediction

Predict the residue sequence given the structure backbone.

Instruction

Click the run button to upload your .pdb/.cif file to get the amino acid sequence and structure sequence in section 3.3.1.
Mask the amino acids in the sequence with #.
Enter the masked amino acid sequence into the "masked_aa_seq" input box in section 3.3.2.
Complete some task configs.
Click the run button to get the predicted amino acid sequence.

Task config

method refers to the prediction method. It could be either argmax or multinomial.
- argmax selects the amino acid with the highest probability.
- multinomial samples an amino acid from the multinomial distribution.
num_samples refers to the number of output amino acid sequences.

Model

Default model is Official pretrained SaProt (650M).

Inverse folding dataset

PDB/CIF file

SaprotHub v1 (will be deprecated in future)

Catalog

0.1: Task Overview

Task type

Classification and Regression prediction task

Zero-shot prediciton task

0.2: Dataset Overview

What is SA(Structure-aware) Sequence

Data Type

How to find a SaprotHub Dataset

Scripts for dataset preparation

0.3: Model Overview

Model type

How to find a model on SaprotHub

0.4: Contribute to SaprotHub

Join SaprotHub Organization

Contribute to SaprotHub

1. Transfer your model to SaprotHub (Recommended)

2. Create a new model repository and upload model files

1.1: Switch your runtime type to GPU

⚠️IMPORTANT⚠️

Current runtime type

Switch runtime type

2.1: Train your model

Video

Task type

Base model

Training dataset

Example Dataset

Dataset Format

⚠️IMPORTANT⚠️:

Format of comlum label

Dataset for protein-protein tasks

Training config

Regular config

Advanced Config

Interrupt Training to Avoid Overfitting

Instruction

2.2: Upload your model

Config

Example

Instruction

3.1: Classification Regression Prediction

Video

Task type

Model

Dataset

Example Dataset

Dataset Format

Dataset for protein-protein tasks

Instruction

3.2: Mutational Effect Prediction

Mutation Task

Model

Mutation information

Mutation dataset

Example Dataset

Dataset Format

For Saturation mutagenesis task

For Single-site or Multi-site mutagenesis task

Instruction

3.3: Inverse Folding Prediction

Instruction

Task config

Model

Inverse folding dataset

Clone this wiki locally

Format of comlum `label`

For `Saturation mutagenesis` task

For `Single-site or Multi-site mutagenesis` task