-
Notifications
You must be signed in to change notification settings - Fork 8
SaprotHub v2 (latest)
The "I want to train my own model" component in ColabSaprot empowers biologists to create customized protein models without much programming or machine learning expertise. Think of training as teaching the model to recognize patterns and make predictions based on your specific protein data - similar to how a student learns from examples.
This versatile training module accepts both protein sequences and structures as input, and potentially supports hundreds of biological tasks (see here). You can train models to:
-
Predict protein-level properties (like function, localization, or stability)
-
Analyze residue-level features (such as binding sites or secondary structure)
-
Study protein-protein interactions
-
Perform various classification and regression tasks relevant to your research
Simply provide your protein data and select your task of interest - ColabSaprot handles all the complex machine learning processes in the background. Whether you're studying enzyme activity, protein folding, or protein-protein interactions, this user-friendly tool helps you build specialized models tailored to your research needs without requiring technical expertise in AI or programming.
The "Use Existing Models to Make Prediction" component allows you to leverage pre-trained or fine-tuned Saprot models for immediate predictions. You can utilize models you've personally trained (via the "I want to train my own model" component), access shared models from the SaprotHub community, or even combine multiple models to enhance prediction accuracy.
This module supports all training tasks and introduces additional capabilities like:
-
Mutation effect prediction
-
Protein sequence design
-
Protein embedding extraction for downstream tasks
For instance, you could predict protein properties using a model trained by another researcher, assess the impact of specific mutations, or design new protein sequences with desired characteristics. The module streamlines the prediction process - simply input your protein data, select the appropriate model(s), and obtain results within minutes. Whether you're exploring protein functions, optimizing sequences, or analyzing mutations, this tool provides a straightforward way to access state-of-the-art protein prediction capabilities without technical complexity.
The "I Want to Share My Model Publicly" module embodies our Open Protein Modeling Consortium (OPMC) initiative, enabling seamless collaboration and knowledge sharing within the biological research community. This module allows biologists to share their high-quality trained models with one click, without requiring the release of their proprietary datasets - making model sharing more appealing to researchers who wish to maintain exclusive access to their valuable experimental data.
Through SaprotHub, our dedicated model-sharing platform, researchers can easily discover relevant models using our specialized search engine (see here) with keyword functionality, access shared models, and perform continuous learning with their own data. When using peer-shared models, researchers are encouraged to provide appropriate citations and credits in their work. The advanced AI technologies underlying this sharing mechanism are detailed in our paper.
This sharing ecosystem promotes a collaborative environment where:
-
Researchers can easily access and build upon each other's work
-
Models can be continuously improved through community contributions
-
Knowledge and resources are freely shared and utilized
-
Collaboration barriers are minimized
By facilitating this open exchange of protein models, we aim to accelerate scientific discovery and foster a collaborative research environment where accessing, sharing, and building upon existing models becomes seamless and straightforward.
As a beginner with no coding or ML experience, what ML concepts should I have to quickly use ColabSaprot?
To better use ColabSaprot, it is beneficial to have a basic understanding of the following common concepts:
- Basic idea of model training and model prediction
- Concepts of classification and regression tasks
- Basic idea of pre-training and fine-tuning
- Purpose and differences among training/validation/test sets
- Several hyperparameters (batch size, learning rate, training epochs)
- Model overfitting and how to detect it through validation loss curves
ColabSaprot features automated model saving before overfitting occurs and provides some automatic hyperparameter options. For ML beginners, a basic understanding of these concepts is sufficient - you can quickly learn them through ChatGPT or ML blogs. It's a one-time effort. Hope you like it!
-
Batch size
The batch size should be chosen based on your training dataset size. Adaptive (the preferred default) automatically determines the batch size through our implementation based on your data size. You can also set it manually - for large datasets, use values like 32, 64, 128, 256; for smaller datasets, use 8, 4, or 2. Note that when setting your own batch size, GPU memory usage is not guaranteed. -
Epoch
Epoch refers to the number of training iterations. A large number needs more training time. Through our internal implementation, the best performing model will be automatically saved or replaced after each training epoch. Please note that Colab has runtime limitations: 12 hours for free users and 24 hours for Colab Pro+ subscribers. -
Learning rate
Learning rate affects the convergence speed of the model. We find that5.0e-4
is a good default value for SaProt 650M model and1.0e-3
for SaProt 35M model.
What is a good mutaion?
A positive score means the mutation is better than the wild type from evolution perspective (the larger the better). See below for more details.
Details
Saprot predicts mutational effects using the log odds ratio at the mutated position, which was proposed by Meier et al. in Language models enable zero-shot prediction of the effects of mutations on protein function. In the original paper, the calculation is formalized as follows:
We denote
Here
Training sessions may timeout after several hours of browser inactivity. Choose one of these two solutions to ensure successful completion:
You can manually add controls to the web console. It will then simulate human operations to keep the page active. The method is from here. Please note that due to Colab's time limit, your session will be disconnected when the maximum time limit is reached (>12 hours for free user, or >24 hours for Colab Pro user).
function ConnectButton(){
console.log("Connect pushed");
document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click()
}
setInterval(ConnectButton,60000);
If you encounter this warning, you should first type allow pasting
into the console and press Enter
to allow your broswer to paste.
Finally, you will get a number and please record this number. It will be used to terminate above program when your training is done.
Step 3: When your training is done, terminate above program. Please Copy the code below and paste it into the console. Then press Enter
to run
clearInterval(recorded_number) // Replace the number with your recorded number
At each epoch, ColabSaprot will automatically save the model checkpoint with the best performance on the validation dataset. And you can resume the training from the checkpoint.
Step 1: Join SaprotHub
Due to Google Colab's synchronization mechanism, we should avoid opening multiple ColabSaprot webpages at the same time.
If the program is still running when the connection is lost, it will prevent you from reconnecting to the server, and all buttons on the interface will be unresponsive, as shown below:
To solve this problem, you just need to stop the program.
After the program stops, wait for a moment and you will be automatically reconnected to the server.
And then you can click the run-button again to start to use ColabSaprot.
Model performance monitoring in ColabSaprot is facilitated through real-time visualization of validation set metrics (shown below). We assess generalization capability by evaluating performance on a testing set comprising protein samples that remain completely isolated from the training process.
During training, a sustained decline in validation set performance across multiple epochs signals potential overfitting. ColabSaprot addresses this through automatic model checkpointing, preserving optimal model parameters before performance degradation occurs.
Note: Both Saprot and other protein language models may show limited efficacy when trained on extremely small datasets (fewer than several dozen samples or when using the provided toy datasets). This limitation represents a key challenge in contemporary machine learning.