-
Notifications
You must be signed in to change notification settings - Fork 8
SaprotHub v1 (will be deprecated in future)
-
0: Preliminary
-
1: Installation
-
2: Train and Share your model
-
3: Use your model to predict
Different models are designed for different tasks, so it's essential to understand which type your task belongs to.
📍To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.
Here are the task types and their description, so you can recognize your task type based on your task description and objectives.
For Classification and Regression prediction task:
- Protein-level Classification Task
- Protein-level Regression Task
- Residue-level Classification Task
- Protein-protein Classification Task
- Protein-protein Regression Task
For Zero-shot prediciton task:
- Mutational effect prediction
- Inverse folding prediction
Train a model based on SaProt and use it to make prediction.
Task Type | Task Description | Example |
---|---|---|
Protein-level Classification | Classify protein sequences. | - Fold Class Prediction - Localization Prediction - Function Prediction |
Protein-level Regression | Predict the value of some property of a protein sequence. | - Thermal Stability Prediction - Fluorescence Intensity Prediction - Binding Affinity Prediction |
Residue-level Classification | Classify the amino acids in a protein sequence. | - Secondary Structure Prediction - Binding Site Prediction - Active Site Prediction |
Protein-protein Classification | Predict if there is interaction between the two proteins. | - Protein-Protein Interaction (PPI) Prediction - Interaction Type Classification Disease - Associated Interaction Prediction |
Protein-protein Regression | Predict the ability of interaction between the two proteins. | - Interaction Strength Prediction - Binding Free Energy Calculation - Interaction Affinity Prediction |
Directly use SaProt (650M) to make prediction.
Task Type | Task Description | Example |
---|---|---|
Mutational Effect Prediction | Predict the mutational effect based on the wild type sequence and mutation information. | - Enzyme Activity Prediction - Virus Fitness Prediction - Driver Mutation Prediction |
Inverse Folding Prediction | Predict the residue sequence given the structure backbone. | - Enzyme Function Optimization - Protein Stability Enhancement - Protein Folding Prediction |
You can use your private data to train and predict. Below are the various data formats corresponding to different data types.
We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.
The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.
Here you can convert your data into SA Sequence format.
- Single AA Sequence
- Single SA Sequence
- Single UniProt ID
- Single PDB/CIF Structure
- Multiple AA Sequences
- Multiple SA Sequences
- Multiple UniProt IDs
- Multiple PDB/CIF Structures
- SaprotHub Dataset
For tasks that require two protein sequences as input (pair classification & pair regression) :
- A pair of AA Sequences
- A pair of SA Sequences
- A pair of UniProt IDs
- A pair of PDB/CIF Structures
- Multiple pairs of AA Sequences
- Multiple pairs of SA Sequences
- Multiple pairs of UniProt IDs
- Multiple pairs of PDB/CIF Structures
- Go to Official SaProtHub Repository to find some datasets.
- Copy the
Dataset ID
for future use.
Link | |
---|---|
Get Structure-Aware Sequence | here |
Convert .fa file to .csv dataset (data type:Multiple AA sequences ) |
here |
Randomly split your dataset | here |
- Official pretrained SaProt (35M)
- Official pretrained SaProt (650M)
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
- Multi-model on SaprotHub
Model type | Used for | Description | Input |
---|---|---|---|
Official pretrained SaProt (35M) |
Training | Train a protein language model based on SaProt(35M) with your dataset | - |
Official pretrained SaProt (650M) |
Training | Train a protein language model based on SaProt(650M) with your dataset | - |
Trained by yourself on ColabSaprot |
Continually training, Prediction | Once you have completed training the model, select this option to use the model you have trained on ColabSaprot for continual training or prediction | Select the model from the dropdown menu |
Shared by peers on SaprotHub |
Continually training, Prediction | Use models shared on SaprotHub for continual training or prediction | Enter the model ID |
Saved in your local computer |
Continually training, Prediction | Use models saved on your local computer (.zip file which were saved when finishing training) for continual training or prediction | Upload the .zip file |
Multi-models on SaprotHub |
Prediction | Ensemble multiple models shared on SaprotHub for prediction. Each sample will be predicted using multiple models. Note that: For classification tasks, voting will be used to determine the final predicted category; for regression tasks, the predicted values from each model will be averaged. |
Enter the model IDs |
- Go to Official SaProtHub Repository to find some model based on your requirements.
- Copy the
Model ID
for future use.
Before contributing to SaprotHub, you need to join the SaprotHub Huggingface Organization to gain write access to the subset of repos within the Organization that you have created.
You have two ways to contribute to SaprotHub:
Once you have uploaded the model to your Huggingface repository using ColabSaprot, you can directly transfer your model to SaprotHub.
You can manually create a new model repository on SaprotHub, and then upload the model files to this repository.
Before installing SaProt, please SWITCH YOUR RUNTIME TYPE TO GPU!!!
You can check the current runtime type in the upper right corner of the page.
If the current runtime type is CPU, you need to switch it to GPU (either the free T4 or the paid A100) for a better training experience.
Please follow the steps below to switch the runtime to GPU:
- Click the dropdown button
- Select option "Change runtime type"
- Select a GPU
- Click "Save" button
- Each time you switch the runtime, all code blocks need to be re-executed.
Click here for detailed information on each task type.
- Protein-level Classification
- Protein-level Regression
- Residue-level Acid Classification
- Protein-protein Classification
- Protein-protein Regression
Click here for detailed information on each model type.
- Official pretrained SaProt (35M)
- Official pretrained SaProt (650M)
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
Example datasets are available in this folder and this path /SaprotHub/upload_files
.
Dataset should be a .csv file with three required columns: sequence
, label
and stage
- The content of column
sequence
depends on your data type. See the table - The content of column
label
depends on your task type. See the table - The column
stage
indicate whether the sample is used for training, validation, or testing.
- Ensure your dataset includes samples for all three stages. The values are:
train
,valid
,test
. - Due to GPU memory limits, protein sequences used for training (where the
stage
column istrain
) will be truncated to the first 1024 amino acids, while sequences for validation and testing will remain uncut.
Data type | Interface | Input | Example |
---|---|---|---|
Multiple AA Sequences |
An upload button |
file : the .csv file containing three columns: sequence , label and stage
|
|
Multiple SA Sequences |
An upload button |
file : the .csv file containing three columns: sequence , label and stage
|
|
Multiple UniProt IDs |
An upload button |
file : the .csv file containing three columns: sequence , label and stage
|
|
Multiple PDB/CIF Structures |
Two upload button |
file : a .csv file containing five columns: sequence , type , chain , label and stage sturcture files : a .zip file containing all the structure files |
type : Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".chain : For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. |
SaprotHub Dataset |
An input box |
Dataset ID : SaprotHub Dataset ID |
Find more datasets on SaprotHub |
Example of comlum label
for different task type (the data type in these examples is Multiple SA sequences
)
Task type | Label | Example | Description |
---|---|---|---|
Protein-level classification | Category index starting from zero | - The task have 2 protein sequence categories: 0, 1. - Each protein sequence has a corresponding category index. |
|
Protein-level regression | Numerical values | - Each protein sequence has a corresponding numerical label to represent the value of some property. | |
Residue-level classification | A list of category indices for each amino acid | - The task have 3 animo acid categories: 0, 1, 2. - Each animo acid has a corresponding category index. |
Parameter | Description |
---|---|
batch_size |
batch_size depends on the number of training samples. "Adaptive" (default choice) refers to automatic batch size according to your data size. If your training data set is large enough, you can use 32, 64, 128, 256, ..., others can be set to 8, 4, 2. Note that: You can not use a larger batch size if you use the Colab default T4 GPU. Strongly suggest you subscribe to Colab Pro for an A100 GPU.). |
max_epochs |
max_epochs refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust max_epochs to control training duration. Note that: The max running time of colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user |
learning_rate |
learning_rate affects the convergence speed of the model. Through experimentation, we have found that 5.0e-4 is a good default value for base model Official pretrained SaProt (650M) and 1.0e-3 for Official pretrained SaProt (35M) . |
For users with some machine learning background who want to further customize the training process, we offer some advanced settings. Simply expand the code cell and modify the values of the variables to take effect.
Parameter | Description |
---|---|
GPU_batch_size |
The GPU_batch_size determines the number of samples in a batch on a single GPU.Note that: You need to modify both GPU_batch_size and accumulate_grad_batches simultaneously and the batch_size selected in the dropdown menu will be overridden. |
accumulate_grad_batches |
Due to hardware limitations, we may not be able to use enough samples in a single batch to perform a gradient update for the model. Therefore, we can adjust the accumulate_grad_batches parameter, which controls how many batches of samples the model will use for a single gradient update.Note that: You need to modify both GPU_batch_size and accumulate_grad_batches simultaneously and the batch_size selected in the dropdown menu will be overridden. |
num_workers |
num_workers specifies the number of threads or processes used for parallel data loading and processing. |
seed |
The seed can control the pseudorandom number sequence. |
r |
r represents the rank of the low-rank decomposition in LoRA. |
lora_dropout |
lora_dropout specifies the dropout rate applied during LoRA training to prevent overfitting. |
lora_alpha |
lora_alpha is a scaling factor used to balance the contribution of the low-rank components in LoRA. |
After each validation (default validation interval is half an epoch), the model with the highest performance on the validation set will be automatically saved.
During the training process, if you observe that the current model is at risk of overfitting by looking at the curve, you can interrupt the training at any time.
After interruption, the program will automatically test the model's generalization performance on the test set.
You can interrupt the training in the following ways:
- Use the shortcut key:
Command
/Ctrl
+M
+I
- In the top menu of the Colab interface, select
Runtime
->Interrupt execution
.
- Complete the configs and then click the run button
- Complete additional input and then click the “Start Trainining” button
- Monitor the training process by the progress bar and the plots
- Check test result and save the model
You can upload the model to your Huggingface repository and then contribute it to SaprotHub.
You need to add some description for your model:
-
name
: The name of your model. -
description
: The description of your model (which task is your model used for). -
label_meanings
:- For classification model, please provide detailed information about the meanings of all labels.
- For regression model, please provide the numerical range of value.
For classification models:
Parameter | Value |
---|---|
name | Subcellular_Localization |
description | This model is used for the Subcellular Localization Classification Task. It takes a protein sequence as input and outputs which of the 10 categories the protein belongs to. |
label_meanings | Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell.membrane, Endoplasmic.reticulum, Plastid, Golgi.apparatus, Lysosome/Vacuole, Peroxisome |
For regression models:
Parameter | Value |
---|---|
name | Thermostability |
description | This model is used for the Thermostability Regression Task. It takes a protein sequence as input and outputs the thermostability of the protein. |
label_meanings | Label corresponds to the protein melting temperature (Tm) normalized using the Min-Max normalization method. |
You can also edit the README.md
to provide more information in the model card, such as Dataset description
, Performance
and so on.
- Click run button, Find your token and Login Huggingface
- Complete model card config and then click run button to upload
- Check your model repo
Click here for detailed information on each task type.
- Protein-level Classification
- Protein-level Regression
- Residue-level Acid Classification
- Protein-protein Classification
- Protein-protein Regression
Click here for detailed information on each model type.
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
- Multi-models on SaprotHub
Example datasets are available in this folder and this path /SaprotHub/upload_files
.
Data type | Interface | Input | Example |
---|---|---|---|
Single AA Sequence |
An input box |
sequence : the amino acid sequence |
sequence : MEETMKLATM |
Single SA Sequence |
An input box |
sequence : the structure-aware sequence |
sequence : MdEvEvTvMpKpLpApTaMp |
Single UniProt ID |
An input box |
sequence : the UniProt ID |
sequence : O95905 |
Single PDB/CIF structure |
Two input box and an upload button |
type : Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".chain : For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.structure file : the .pdb/.cif structure file |
type : AF2chain : Astructure file : O95905.pdb |
Multiple AA Sequences |
An upload button |
file : the .csv file containing one column: sequence
|
|
Multiple SA Sequences |
An upload button |
file : the .csv file containing one column: sequence
|
|
Multiple UniProt IDs |
An upload button |
file : the .csv file containing one column: sequence
|
|
Multiple PDB/CIF Structures |
Two upload button |
file : a .csv file containing three columns: sequence , type and chain structure files : a .zip file containing all the structure files |
|
SaprotHub Dataset |
An input box |
Dataset ID : SaprotHub Dataset ID |
Find more datasets on SaprotHub |
- Complete the configs and then click the run button
- Complete additional input and then click the “Make Prediction” button
- Check and download prediction result
- Single-site or Multi-site mutagenesis
- Saturation mutagenesis
Default model is Official pretrained SaProt (650M)
.
Here is the detail about the representation of mutation information:
mode | mutation information |
---|---|
Single-site mutagenesis | H87Y |
Multi-site mutagenesis | H87Y:V162M:P179L:P179R |
- For
Single-site mutagenesis
, we use a term like "H87Y" to denote the mutation, where the first letter represents the original amino acid, the number in the middle represents the mutation site (indexed starting from 1), and the last letter represents the mutated amino acid, - For
Multi-site mutagenesis
, we use a colon ":" to connect each single-site mutations, such as "H87Y:V162M:P179L:P179R".
Example datasets are available in this folder and this path /SaprotHub/upload_files
.
The mutation dataset is the same as the dataset used for classification/regression prediction tasks.
One more information are required: mutation
.
Data type | Interface | Input | Example |
---|---|---|---|
Single SA Sequence |
Two input box |
sequence : the structure-aware sequencemutation : the mutation information |
sequence : MdEvEvTvMpKpLpApmutation : M1H:E2L:E3Q:T4A:M5P:K6Y:L7V:A8P |
Single UniProt ID |
Two input box |
sequence : the UniProt IDmutation : the mutation information |
sequence : O95905mutation : H87Y:V162M:P179L |
Single PDB/CIF structure |
Three input box and an upload button |
type : Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".chain : For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.structure file : the .pdb/.cif structure filemutation : the mutation information |
type : AF2chain : Astructure file : O95905.pdbmutation : H87Y:V162M:P179L |
Multiple SA Sequences |
An upload button |
file : the .csv file containing two columns: sequence and mutation
|
|
Multiple UniProt IDs |
An upload button |
file : the .csv file containing two columns: sequence and mutation
|
|
Multiple PDB/CIF Structures |
Two upload button |
file : a .csv file containing four columns: sequence , type , chain and mutation structure files : a .zip file containing all the structure files |
- Complete task config and then click run button to apply
- Provide dataset (and mutation information for Single-site or Multi-site mutagenesis), and then click the “Mutational Effect Predict” button
- Download the result
Predict the residue sequence given the structure backbone.
- Click the run button to upload your .pdb/.cif file to get the amino acid sequence and structure sequence in section 3.3.1.
- Mask the amino acids in the sequence with
#
. - Enter the masked amino acid sequence into the "masked_aa_seq" input box in section 3.3.2.
- Complete some task configs.
- Click the run button to get the predicted amino acid sequence.
-
method
refers to the prediction method. It could be eitherargmax
ormultinomial
.-
argmax
selects the amino acid with the highest probability. -
multinomial
samples an amino acid from the multinomial distribution.
-
-
num_samples
refers to the number of output amino acid sequences.
Default model is Official pretrained SaProt (650M)
.
PDB/CIF file