Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline Implementation (VariPred) Model #5

Open
merdivane opened this issue May 29, 2023 · 4 comments
Open

Baseline Implementation (VariPred) Model #5

merdivane opened this issue May 29, 2023 · 4 comments
Assignees

Comments

@merdivane
Copy link
Contributor

No description provided.

@ofivite
Copy link
Contributor

ofivite commented May 29, 2023

VariPred is one specific solution of fine-tuning, which for a given protein sequence:

  1. takes embedding vectors from pretrained PLMs (e.g. ESM) for wildtype and mutated positions in the sequence
  2. concatenates them
  3. trains simple feedforward network to predict the pathogenicity given the concat vector

I would say, a simpler baseline would be to not train it but rather use some distance metric between wildtype and mutation embedding vectors to see how it correlates with the target (pathogenicity). Similarly as studied in Nucleotide Transformer paper (Fig. 4).

The best performing out of those distance metrics would be our own baseline and the starting point of setting up the pipeline. Then, we can try fine-tuning as VariPred or possibly other strategies to improve upon it.

@ofivite ofivite self-assigned this May 29, 2023
@AllenChienXXX
Copy link
Collaborator

VariPred is one specific solution of fine-tuning, which for a given protein sequence:

  1. takes embedding vectors from pretrained PLMs (e.g. ESM) for wildtype and mutated positions in the sequence
  2. concatenates them
  3. trains simple feedforward network to predict the pathogenicity given the concat vector

I would say, a simpler baseline would be to not train it but rather use some distance metric between wildtype and mutation embedding vectors to see how it correlates with the target (pathogenicity). Similarly as studied in Nucleotide Transformer paper (Fig. 4).

The best performing out of those distance metrics would be our own baseline and the starting point of setting up the pipeline. Then, we can try fine-tuning as VariPred or possibly other strategies to improve upon it.

Do you know the source of this model?

@merdivane
Copy link
Contributor Author

Nucleotide Transformer
Model weights available here: https://huggingface.co/InstaDeepAI
This model is now available to use with the transformers library! To use, please install from main, i.e. pip install --upgrade git+https://github.com/huggingface/transformers.git
Check out their paper for inspiration: https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1

@ofivite
Copy link
Contributor

ofivite commented May 30, 2023

I am actually not sure, would their model be useful for us? Because it's for DNA sequences but we do proteins, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants