Skip to content

SaprotHub v1 (will be deprecated in future)

Jin Su edited this page Nov 25, 2024 · 10 revisions

Catalog

0.1: Task Overview

Different models are designed for different tasks, so it's essential to understand which type your task belongs to.

📍To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.

Task type

Here are the task types and their description, so you can recognize your task type based on your task description and objectives.

For Classification and Regression prediction task:

  1. Protein-level Classification Task
  2. Protein-level Regression Task
  3. Residue-level Classification Task
  4. Protein-protein Classification Task
  5. Protein-protein Regression Task

For Zero-shot prediciton task:

  1. Mutational effect prediction
  2. Inverse folding prediction

Classification and Regression prediction task

Train a model based on SaProt and use it to make prediction.

Task Type Task Description Example
Protein-level Classification Classify protein sequences. - Fold Class Prediction
- Localization Prediction
- Function Prediction
Protein-level Regression Predict the value of some property of a protein sequence. - Thermal Stability Prediction
- Fluorescence Intensity Prediction
- Binding Affinity Prediction
Residue-level Classification Classify the amino acids in a protein sequence. - Secondary Structure Prediction
- Binding Site Prediction
- Active Site Prediction
Protein-protein Classification Predict if there is interaction between the two proteins. - Protein-Protein Interaction (PPI) Prediction
- Interaction Type Classification Disease
- Associated Interaction Prediction
Protein-protein Regression Predict the ability of interaction between the two proteins. - Interaction Strength Prediction
- Binding Free Energy Calculation
- Interaction Affinity Prediction

Zero-shot prediciton task

Directly use SaProt (650M) to make prediction.

Task Type Task Description Example
Mutational Effect Prediction Predict the mutational effect based on the wild type sequence and mutation information. - Enzyme Activity Prediction
- Virus Fitness Prediction
- Driver Mutation Prediction
Inverse Folding Prediction Predict the residue sequence given the structure backbone. - Enzyme Function Optimization
- Protein Stability Enhancement
- Protein Folding Prediction

0.2: Dataset Overview

You can use your private data to train and predict. Below are the various data formats corresponding to different data types.

What is SA(Structure-aware) Sequence

We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.

The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.

Here you can convert your data into SA Sequence format.

Data Type

  1. Single AA Sequence
  2. Single SA Sequence
  3. Single UniProt ID
  4. Single PDB/CIF Structure
  5. Multiple AA Sequences
  6. Multiple SA Sequences
  7. Multiple UniProt IDs
  8. Multiple PDB/CIF Structures
  9. SaprotHub Dataset

For tasks that require two protein sequences as input (pair classification & pair regression) :

  1. A pair of AA Sequences
  2. A pair of SA Sequences
  3. A pair of UniProt IDs
  4. A pair of PDB/CIF Structures
  5. Multiple pairs of AA Sequences
  6. Multiple pairs of SA Sequences
  7. Multiple pairs of UniProt IDs
  8. Multiple pairs of PDB/CIF Structures

How to find a SaprotHub Dataset

  1. Go to Official SaProtHub Repository to find some datasets.
  2. Copy the Dataset ID for future use.

Scripts for dataset preparation

Link
Get Structure-Aware Sequence here
Convert .fa file to .csv dataset (data type:Multiple AA sequences) here
Randomly split your dataset here

0.3: Model Overview

Model type

  1. Official pretrained SaProt (35M)
  2. Official pretrained SaProt (650M)
  3. Trained by yourself on ColabSaprot
  4. Shared by peers on SaprotHub
  5. Saved in your local computer
  6. Multi-model on SaprotHub
Model type Used for Description Input
Official pretrained SaProt (35M) Training Train a protein language model based on SaProt(35M) with your dataset -
Official pretrained SaProt (650M) Training Train a protein language model based on SaProt(650M) with your dataset -
Trained by yourself on ColabSaprot Continually training, Prediction Once you have completed training the model, select this option to use the model you have trained on ColabSaprot for continual training or prediction Select the model from the dropdown menu
Shared by peers on SaprotHub Continually training, Prediction Use models shared on SaprotHub for continual training or prediction Enter the model ID
Saved in your local computer Continually training, Prediction Use models saved on your local computer (.zip file which were saved when finishing training) for continual training or prediction Upload the .zip file
Multi-models on SaprotHub Prediction Ensemble multiple models shared on SaprotHub for prediction.
Each sample will be predicted using multiple models.
Note that: For classification tasks, voting will be used to determine the final predicted category; for regression tasks, the predicted values from each model will be averaged.
Enter the model IDs

How to find a model on SaprotHub

  1. Go to Official SaProtHub Repository to find some model based on your requirements.
  2. Copy the Model ID for future use.

0.4: Contribute to SaprotHub

Join SaprotHub Organization

Before contributing to SaprotHub, you need to join the SaprotHub Huggingface Organization to gain write access to the subset of repos within the Organization that you have created.

Contribute to SaprotHub

You have two ways to contribute to SaprotHub:

1. Transfer your model to SaprotHub (Recommended)

Once you have uploaded the model to your Huggingface repository using ColabSaprot, you can directly transfer your model to SaprotHub.

2. Create a new model repository and upload model files

You can manually create a new model repository on SaprotHub, and then upload the model files to this repository.

1.1: Switch your runtime type to GPU

⚠️IMPORTANT⚠️

Before installing SaProt, please SWITCH YOUR RUNTIME TYPE TO GPU!!!


Current runtime type

You can check the current runtime type in the upper right corner of the page.

If the current runtime type is CPU, you need to switch it to GPU (either the free T4 or the paid A100) for a better training experience.


Switch runtime type

Please follow the steps below to switch the runtime to GPU:

  1. Click the dropdown button
  2. Select option "Change runtime type"
  3. Select a GPU
  4. Click "Save" button
  5. Each time you switch the runtime, all code blocks need to be re-executed.

2.1: Train your model

Video

IMAGE ALT TEXT HERE

Task type

Click here for detailed information on each task type.

  1. Protein-level Classification
  2. Protein-level Regression
  3. Residue-level Acid Classification
  4. Protein-protein Classification
  5. Protein-protein Regression

Base model

Click here for detailed information on each model type.

  1. Official pretrained SaProt (35M)
  2. Official pretrained SaProt (650M)
  3. Trained by yourself on ColabSaprot
  4. Shared by peers on SaprotHub
  5. Saved in your local computer

Training dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

Dataset should be a .csv file with three required columns: sequence, label and stage

  • The content of column sequence depends on your data type. See the table
  • The content of column label depends on your task type. See the table
  • The column stage indicate whether the sample is used for training, validation, or testing.

⚠️IMPORTANT⚠️:

  • Ensure your dataset includes samples for all three stages. The values are: train, valid, test.
  • Due to GPU memory limits, protein sequences used for training (where the stage column is train) will be truncated to the first 1024 amino acids, while sequences for validation and testing will remain uncut.
Data type Interface Input Example
Multiple AA Sequences An upload button file: the .csv file containing three columns: sequence, label and stage
Multiple SA Sequences An upload button file: the .csv file containing three columns: sequence, label and stage
Multiple UniProt IDs An upload button file: the .csv file containing three columns: sequence, label and stage
Multiple PDB/CIF Structures Two upload button file: a .csv file containing five columns: sequence, type, chain, label and stage
sturcture files: a .zip file containing all the structure files
type: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".
chain: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.
SaprotHub Dataset An input box Dataset ID: SaprotHub Dataset ID Find more datasets on SaprotHub

Format of comlum label

Example of comlum label for different task type (the data type in these examples is Multiple SA sequences)

Task type Label Example Description
Protein-level classification Category index starting from zero - The task have 2 protein sequence categories: 0, 1.
- Each protein sequence has a corresponding category index.
Protein-level regression Numerical values - Each protein sequence has a corresponding numerical label to represent the value of some property.
Residue-level classification A list of category indices for each amino acid - The task have 3 animo acid categories: 0, 1, 2.
- Each animo acid has a corresponding category index.

Dataset for protein-protein tasks

Training config

Regular config

Parameter Description
batch_size batch_size depends on the number of training samples. "Adaptive" (default choice) refers to automatic batch size according to your data size.
If your training data set is large enough, you can use 32, 64, 128, 256, ..., others can be set to 8, 4, 2.
Note that: You can not use a larger batch size if you use the Colab default T4 GPU. Strongly suggest you subscribe to Colab Pro for an A100 GPU.).
max_epochs max_epochs refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust max_epochs to control training duration.
Note that: The max running time of colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user
learning_rate learning_rate affects the convergence speed of the model. Through experimentation, we have found that 5.0e-4 is a good default value for base model Official pretrained SaProt (650M) and 1.0e-3 for Official pretrained SaProt (35M).

Advanced Config

For users with some machine learning background who want to further customize the training process, we offer some advanced settings. Simply expand the code cell and modify the values of the variables to take effect.

Parameter Description
GPU_batch_size The GPU_batch_size determines the number of samples in a batch on a single GPU.
Note that: You need to modify both GPU_batch_size and accumulate_grad_batches simultaneously and the batch_size selected in the dropdown menu will be overridden.
accumulate_grad_batches Due to hardware limitations, we may not be able to use enough samples in a single batch to perform a gradient update for the model. Therefore, we can adjust the accumulate_grad_batches parameter, which controls how many batches of samples the model will use for a single gradient update.
Note that: You need to modify both GPU_batch_size and accumulate_grad_batches simultaneously and the batch_size selected in the dropdown menu will be overridden.
num_workers num_workers specifies the number of threads or processes used for parallel data loading and processing.
seed The seed can control the pseudorandom number sequence.
r r represents the rank of the low-rank decomposition in LoRA.
lora_dropout lora_dropout specifies the dropout rate applied during LoRA training to prevent overfitting.
lora_alpha lora_alpha is a scaling factor used to balance the contribution of the low-rank components in LoRA.

Interrupt Training to Avoid Overfitting

After each validation (default validation interval is half an epoch), the model with the highest performance on the validation set will be automatically saved.

⚠️Therefore, avoid interrupting the training before the first validation (or before the curve plots are generated), as this will prevent the model from being created.

During the training process, if you observe that the current model is at risk of overfitting by looking at the curve, you can interrupt the training at any time.

After interruption, the program will automatically test the model's generalization performance on the test set.

You can interrupt the training in the following ways:

  1. Use the shortcut key: Command/Ctrl + M + I
  2. In the top menu of the Colab interface, select Runtime -> Interrupt execution.

Instruction

  1. Complete the configs and then click the run button
  2. Complete additional input and then click the “Start Trainining” button
  3. Monitor the training process by the progress bar and the plots
  4. Check test result and save the model

2.2: Upload your model

You can upload the model to your Huggingface repository and then contribute it to SaprotHub.

Config

You need to add some description for your model:

  • name: The name of your model.
  • description: The description of your model (which task is your model used for).
  • label_meanings:
    • For classification model, please provide detailed information about the meanings of all labels.
    • For regression model, please provide the numerical range of value.

Example

For classification models:

Parameter Value
name Subcellular_Localization
description This model is used for the Subcellular Localization Classification Task. It takes a protein sequence as input and outputs which of the 10 categories the protein belongs to.
label_meanings Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell.membrane, Endoplasmic.reticulum, Plastid, Golgi.apparatus, Lysosome/Vacuole, Peroxisome

For regression models:

Parameter Value
name Thermostability
description This model is used for the Thermostability Regression Task. It takes a protein sequence as input and outputs the thermostability of the protein.
label_meanings Label corresponds to the protein melting temperature (Tm) normalized using the Min-Max normalization method.

You can also edit the README.md to provide more information in the model card, such as Dataset description, Performance and so on.

Instruction

  1. Click run button, Find your token and Login Huggingface
  2. Complete model card config and then click run button to upload
  3. Check your model repo

3.1: Classification Regression Prediction

Video

IMAGE ALT TEXT HERE

Task type

Click here for detailed information on each task type.

  1. Protein-level Classification
  2. Protein-level Regression
  3. Residue-level Acid Classification
  4. Protein-protein Classification
  5. Protein-protein Regression

Model

Click here for detailed information on each model type.

  1. Trained by yourself on ColabSaprot
  2. Shared by peers on SaprotHub
  3. Saved in your local computer
  4. Multi-models on SaprotHub

Dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

Data type Interface Input Example
Single AA Sequence An input box sequence: the amino acid sequence sequence: MEETMKLATM
Single SA Sequence An input box sequence: the structure-aware sequence sequence: MdEvEvTvMpKpLpApTaMp
Single UniProt ID An input box sequence: the UniProt ID sequence: O95905
Single PDB/CIF structure Two input box and an upload button type: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".
chain: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.
structure file: the .pdb/.cif structure file
type: AF2
chain: A
structure file: O95905.pdb
Multiple AA Sequences An upload button file: the .csv file containing one column: sequence
Multiple SA Sequences An upload button file: the .csv file containing one column: sequence
Multiple UniProt IDs An upload button file: the .csv file containing one column: sequence
Multiple PDB/CIF Structures Two upload button file: a .csv file containing three columns: sequence, type and chain
structure files: a .zip file containing all the structure files
SaprotHub Dataset An input box Dataset ID: SaprotHub Dataset ID Find more datasets on SaprotHub

Dataset for protein-protein tasks

Instruction

  1. Complete the configs and then click the run button
  2. Complete additional input and then click the “Make Prediction” button
  3. Check and download prediction result

3.2: Mutational Effect Prediction

Mutation Task

  • Single-site or Multi-site mutagenesis
  • Saturation mutagenesis

Model

Default model is Official pretrained SaProt (650M).

Mutation information

Here is the detail about the representation of mutation information:

mode mutation information
Single-site mutagenesis H87Y
Multi-site mutagenesis H87Y:V162M:P179L:P179R
  • For Single-site mutagenesis, we use a term like "H87Y" to denote the mutation, where the first letter represents the original amino acid, the number in the middle represents the mutation site (indexed starting from 1), and the last letter represents the mutated amino acid,
  • For Multi-site mutagenesis, we use a colon ":" to connect each single-site mutations, such as "H87Y:V162M:P179L:P179R".

Mutation dataset

Example Dataset

Example datasets are available in this folder and this path /SaprotHub/upload_files.

Dataset Format

For Saturation mutagenesis task

The mutation dataset is the same as the dataset used for classification/regression prediction tasks.

For Single-site or Multi-site mutagenesis task

One more information are required: mutation.

Data type Interface Input Example
Single SA Sequence Two input box sequence: the structure-aware sequence
mutation: the mutation information
sequence: MdEvEvTvMpKpLpAp
mutation: M1H:E2L:E3Q:T4A:M5P:K6Y:L7V:A8P
Single UniProt ID Two input box sequence: the UniProt ID
mutation: the mutation information
sequence: O95905
mutation: H87Y:V162M:P179L
Single PDB/CIF structure Three input box and an upload button type: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".
chain: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.
structure file: the .pdb/.cif structure file
mutation: the mutation information
type: AF2
chain: A
structure file: O95905.pdb
mutation: H87Y:V162M:P179L
Multiple SA Sequences An upload button file: the .csv file containing two columns: sequence and mutation
Multiple UniProt IDs An upload button file: the .csv file containing two columns: sequence and mutation
Multiple PDB/CIF Structures Two upload button file: a .csv file containing four columns: sequence, type, chain and mutation
structure files: a .zip file containing all the structure files

Instruction

  1. Complete task config and then click run button to apply
  2. Provide dataset (and mutation information for Single-site or Multi-site mutagenesis), and then click the “Mutational Effect Predict” button
  3. Download the result

3.3: Inverse Folding Prediction

Predict the residue sequence given the structure backbone.

Instruction

  1. Click the run button to upload your .pdb/.cif file to get the amino acid sequence and structure sequence in section 3.3.1.
  2. Mask the amino acids in the sequence with #.
  3. Enter the masked amino acid sequence into the "masked_aa_seq" input box in section 3.3.2.
  4. Complete some task configs.
  5. Click the run button to get the predicted amino acid sequence.

Task config

  • method refers to the prediction method. It could be either argmax or multinomial.
    • argmax selects the amino acid with the highest probability.
    • multinomial samples an amino acid from the multinomial distribution.
  • num_samples refers to the number of output amino acid sequences.

Model

Default model is Official pretrained SaProt (650M).

Inverse folding dataset

PDB/CIF file

Clone this wiki locally