Skip to content

SaprotHub v2 (latest)

Jin Su edited this page Jan 13, 2025 · 50 revisions

Introduction of options

What is "I want to train my own model"?

The "I want to train my own model" component in ColabSaprot empowers biologists to create customized protein models without much programming or machine learning expertise. Think of training as teaching the model to recognize patterns and make predictions based on your specific protein data - similar to how a student learns from examples.

This versatile training module accepts both protein sequences and structures as input, and potentially supports hundreds of biological tasks (see here). You can train models to:

  • Predict protein-level properties (like function, localization, or stability)

  • Analyze residue-level features (such as binding sites or secondary structure)

  • Study protein-protein interactions

  • Perform various classification and regression tasks relevant to your research

Simply provide your protein data and select your task of interest - ColabSaprot handles all the complex machine learning processes in the background. Whether you're studying enzyme activity, protein folding, or protein-protein interactions, this user-friendly tool helps you build specialized models tailored to your research needs without requiring technical expertise in AI or programming.

What is "using existing models to make prediction"?

The "Use Existing Models to Make Prediction" component allows you to leverage pre-trained or fine-tuned Saprot models for immediate predictions. You can utilize models you've personally trained (via the "I want to train my own model" component), access shared models from the SaprotHub community, or even combine multiple models to enhance prediction accuracy.

This module supports all training tasks and introduces additional capabilities like:

  • Mutation effect prediction

  • Protein sequence design

  • Protein embedding extraction for downstream tasks

For instance, you could predict protein properties using a model trained by another researcher, assess the impact of specific mutations, or design new protein sequences with desired characteristics. The module streamlines the prediction process - simply input your protein data, select the appropriate model(s), and obtain results within minutes. Whether you're exploring protein functions, optimizing sequences, or analyzing mutations, this tool provides a straightforward way to access state-of-the-art protein prediction capabilities without technical complexity.

What is "I want to share my model publicly"?

The "I Want to Share My Model Publicly" module embodies our Open Protein Modeling Consortium (OPMC) initiative, enabling seamless collaboration and knowledge sharing within the biological research community. This module allows biologists to share their high-quality trained models with one click, without requiring the release of their proprietary datasets - making model sharing more appealing to researchers who wish to maintain exclusive access to their valuable experimental data.

Through SaprotHub, our dedicated model-sharing platform, researchers can easily discover relevant models using our specialized search engine (see here) with keyword functionality, access shared models, and perform continuous learning with their own data. When using peer-shared models, researchers are encouraged to provide appropriate citations and credits in their work. The advanced AI technologies underlying this sharing mechanism are detailed in our paper.

This sharing ecosystem promotes a collaborative environment where:

  • Researchers can easily access and build upon each other's work

  • Models can be continuously improved through community contributions

  • Knowledge and resources are freely shared and utilized

  • Collaboration barriers are minimized

By facilitating this open exchange of protein models, we aim to accelerate scientific discovery and foster a collaborative research environment where accessing, sharing, and building upon existing models becomes seamless and straightforward.

As a beginner with no coding or ML experience, what ML concepts should I have to quickly use ColabSaprot?

To better use ColabSaprot, it is beneficial to have a basic understanding of the following common concepts:

  • Basic idea of model training and model prediction
  • Concepts of classification and regression tasks
  • Basic idea of pre-training and fine-tuning
  • Purpose and differences among training/validation/test sets
  • Several hyperparameters (batch size, learning rate, training epochs)
  • Model overfitting and how to detect it through validation loss curves

ColabSaprot features automated model saving before overfitting occurs and provides some automatic hyperparameter options. For ML beginners, a basic understanding of these concepts is sufficient - you can quickly learn them through ChatGPT or ML blogs. It's a one-time effort. Hope you like it!

  • Batch size
    The batch size should be chosen based on your training dataset size. Adaptive (the preferred default) automatically determines the batch size through our implementation based on your data size. You can also set it manually - for large datasets, use values like 32, 64, 128, 256; for smaller datasets, use 8, 4, or 2. Note that when setting your own batch size, GPU memory usage is not guaranteed.

  • Epoch
    Epoch refers to the number of training iterations. A large number needs more training time. Through our internal implementation, the best performing model will be automatically saved or replaced after each training epoch. Please note that Colab has runtime limitations: 12 hours for free users and 24 hours for Colab Pro+ subscribers.

  • Learning rate
    Learning rate affects the convergence speed of the model. We find that 5.0e-4 is a good default value for SaProt 650M model and 1.0e-3 for SaProt 35M model.

What is a good mutaion?

A positive score means the mutation is better than the wild type from evolution perspective (the larger the better). See below for more details.

Details

Saprot predicts mutational effects using the log odds ratio at the mutated position, which was proposed by Meier et al. in Language models enable zero-shot prediction of the effects of mutations on protein function. In the original paper, the calculation is formalized as follows:

We denote $V$ as residue alphabet, $F$ as Foldseek 3Di alphabet and $V × F$ as the cartesian product of $V$ and $F$. Here $T$ represents all mutations and $s_t ∈ V$ is the residue type for mutant and wild-type sequence. Saprot slightly modify the formula above to adapt to the structure-aware vocabulary, where the probability assigned to each residue corresponds to the summation of tokens encompassing that specific residue type, as shown below:

Here $f ∈ F$ is the structure token generated by Foldseek and $s_tf ∈ V × F$ is the structure-aware token in our new vocabulary.

Step 1: Upload the file to Google Drive. Then right click and share your file

image

Step 2: Set file permission

1730970134814

Step 3: Copy the link and paste it into the box

1730970134814 1730970134814

How to handle long-time training

Training sessions may timeout after several hours of browser inactivity. Choose one of these two solutions to ensure successful completion:

Option 1: Avoid losing connection during long-time training

You can manually add controls to the web console. It will then simulate human operations to keep the page active. The method is from here. Please note that due to Colab's time limit, your session will be disconnected when the maximum time limit is reached (>12 hours for free user, or >24 hours for Colab Pro user).

Step 1: Press F12 or Ctrl + Shift + i to open the console

image

Step 2: Copy the code below and paste it into the console. Then press Enter to run

function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);

image

If you encounter this warning, you should first type allow pasting into the console and press Enter to allow your broswer to paste.

image

Finally, you will get a number and please record this number. It will be used to terminate above program when your training is done.

image

Step 3: When your training is done, terminate above program. Please Copy the code below and paste it into the console. Then press Enter to run

clearInterval(recorded_number) // Replace the number with your recorded number

image

Option 2: Resume training from provious model checkpoint

At each epoch, ColabSaprot will automatically save the model checkpoint with the best performance on the validation dataset. And you can resume the training from the checkpoint.

image

A simple way to handle unexpected issues

Step 1: Open the session list

image

Step 2: Delete you session

image

Step 3: Reconnect to a server

image

Share your model to official SaprotHub

Step 1: Join SaprotHub

image

Step 2: Change the owner of your model in setting

image

FAQs for ColabSaprot

1. Can I open multiple ColabSaprot pages and use them simultaneously?

Due to Google Colab's synchronization mechanism, we should avoid opening multiple ColabSaprot webpages at the same time.

2. How can I reconnect when encountering connection issues?

If the program is still running when the connection is lost, it will prevent you from reconnecting to the server, and all buttons on the interface will be unresponsive, as shown below:

image

To solve this problem, you just need to stop the program.

image

After the program stops, wait for a moment and you will be automatically reconnected to the server.

image

And then you can click the run-button again to start to use ColabSaprot.

3. How can I monitor model performance during training and detect overfitting?

Model performance monitoring in ColabSaprot is facilitated through real-time visualization of validation set metrics (shown below). We assess generalization capability by evaluating performance on a testing set comprising protein samples that remain completely isolated from the training process.

During training, a sustained decline in validation set performance across multiple epochs signals potential overfitting. ColabSaprot addresses this through automatic model checkpointing, preserving optimal model parameters before performance degradation occurs.

Note: Both Saprot and other protein language models may show limited efficacy when trained on extremely small datasets (fewer than several dozen samples or when using the provided toy datasets). This limitation represents a key challenge in contemporary machine learning.

training_curve_reg_00

Clone this wiki locally