Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run MMSeqs/FoldSeek with sequences that do not use amino acid sequences #945

Open
danny305 opened this issue Jan 24, 2025 · 4 comments
Open

Comments

@danny305
Copy link

Just like how FoldSeek replaces the 20 amino acids with 20 structure tokens, is there a way to hack MMSeqs/FoldSeek to use a custom vocab?

I want to make a mmseqs database of sequences that use a custom vocab and then search/cluster and make MSAs using the custom vocab.

If you would point me in a direction to do this?

Danny

@milot-mirdita
Copy link
Member

What application do you have in mind?

The main thing to do is to build a new substitution matrix. In addition to the substitution matrix, we also need background probabilities for the alphabet and lambda value that was computed on the fly previously, however, we temporarily removed this code as part of the relicensing in release 16. We will return this functionality soonish though.

We have a R-script that also kind of does this, but it is not very complete or well tested:
https://github.com/soedinglab/MMseqs2/blob/master/util/format_substitution_matrix.R

The above is only true though if you want an alphabet with at most 20/21 letters. This assumption is pretty baked in into MMseqs2. A larger alphabet would require probably quite a bit of refactoring/cleanup.

@danny305
Copy link
Author

Secondary Structure alphabet from STRIDE. So 8 letters.

Do you think this is easily doable?

Sorry we didn't get to meet up in Vancouver!

@danny305
Copy link
Author

If you could provide me some instructions on how to implement what is needed I would be happy to help.

@milot-mirdita
Copy link
Member

You can use and adapt the following script we made for Foldseek:
https://github.com/steineggerlab/foldseek-analysis/blob/main/training/create_submat.py

Additionally, you will need to more letters to the Alphabet so you still have an alphabet with 20 letters, all of these can have the same negative value as the X letter. Also you will currently need to use MMseqs2 release 15 for this, as 16 and 17 temporarily removed support for custom substitution matrices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants