Skip to content

SaprotHub v1 (will be deprecated in future)

Jin Su edited this page Nov 25, 2024 · 10 revisions

0.1: Task Overview

Different models are designed for different tasks, so it's essential to understand which type your task belongs to.

📍To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.

Task type

Here are the task types and their description, so you can recognize your task type based on your task description and objectives.

For Classification and Regression prediction task:

  1. Protein-level Classification Task
  2. Protein-level Regression Task
  3. Residue-level Classification Task
  4. Protein-protein Classification Task
  5. Protein-protein Regression Task

For Zero-shot prediciton task:

  1. Mutational effect prediction
  2. Inverse folding prediction

Classification and Regression prediction task

Train a model based on SaProt and use it to make prediction.

Task Type Task Description Example
Protein-level Classification Classify protein sequences. - Fold Class Prediction
- Localization Prediction
- Function Prediction
Protein-level Regression Predict the value of some property of a protein sequence. - Thermal Stability Prediction
- Fluorescence Intensity Prediction
- Binding Affinity Prediction
Residue-level Classification Classify the amino acids in a protein sequence. - Secondary Structure Prediction
- Binding Site Prediction
- Active Site Prediction
Protein-protein Classification Predict if there is interaction between the two proteins. - Protein-Protein Interaction (PPI) Prediction
- Interaction Type Classification Disease
- Associated Interaction Prediction
Protein-protein Regression Predict the ability of interaction between the two proteins. - Interaction Strength Prediction
- Binding Free Energy Calculation
- Interaction Affinity Prediction

Zero-shot prediciton task

Directly use SaProt (650M) to make prediction.

Task Type Task Description Example
Mutational Effect Prediction Predict the mutational effect based on the wild type sequence and mutation information. - Enzyme Activity Prediction
- Virus Fitness Prediction
- Driver Mutation Prediction
Inverse Folding Prediction Predict the residue sequence given the structure backbone. - Enzyme Function Optimization
- Protein Stability Enhancement
- Protein Folding Prediction

0.2: Dataset Overview

You can use your private data to train and predict. Below are the various data formats corresponding to different data types.

What is SA(Structure-aware) Sequence

We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.

The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.

Here you can convert your data into SA Sequence format.

Data Type

  1. Single AA Sequence
  2. Single SA Sequence
  3. Single UniProt ID
  4. Single PDB/CIF Structure
  5. Multiple AA Sequences
  6. Multiple SA Sequences
  7. Multiple UniProt IDs
  8. Multiple PDB/CIF Structures
  9. SaprotHub Dataset

For tasks that require two protein sequences as input (pair classification & pair regression) :

  1. A pair of AA Sequences
  2. A pair of SA Sequences
  3. A pair of UniProt IDs
  4. A pair of PDB/CIF Structures
  5. Multiple pairs of AA Sequences
  6. Multiple pairs of SA Sequences
  7. Multiple pairs of UniProt IDs
  8. Multiple pairs of PDB/CIF Structures

How to find a SaprotHub Dataset

  1. Go to Official SaProtHub Repository to find some datasets.
  2. Copy the Dataset ID for future use.

Scripts for dataset preparation

Link
Get Structure-Aware Sequence here
Convert .fa file to .csv dataset (data type:Multiple AA sequences) here
Randomly split your dataset here
Clone this wiki locally