Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMPlify ignores sequences containing stop codon indicator #17

Open
jasmezz opened this issue Jan 31, 2025 · 4 comments
Open

AMPlify ignores sequences containing stop codon indicator #17

jasmezz opened this issue Jan 31, 2025 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested

Comments

@jasmezz
Copy link

jasmezz commented Jan 31, 2025

We noticed that AMPlify strictly sticks to the 20 standard amino acids in input sequences and ignores all others, as stated in its help message:

$AMPlify -h
[...]
AMPlify v2.0.0
------------------------------------------------------
Predict whether a sequence is AMP or not.
Input sequences should be in fasta format. 
Sequences should be shorter than 201 amino acids long, 
and should not contain amino acids other than the 20 standard ones. 

So far, so clear. But even if a stop codon is indicated with the commonly used asterisk *, the sequence is ignored. I believe this behaviour might not be desired, because several sequence annotation tools (e.g. Pyrodigal, Prodigal, Bakta, Prokka) append the * by default; for Prodigal, Prokka, and Bakta it is not even possible to deactivate the * as stop codon indicator. Thus, one cannot simply use the output from such annotation tools as input for AMPlify without first removing all *.

My feature request is thus, to have AMPlify accept sequences with stop codon indicator and remove the asterisk internally if necessary.

Minimum reproducible example:

zcat amplify-failed-genes.faa.gz > amplify-failed-genes.faa
AMPlify -s amplify-failed-genes.faa

I'll link another issue where this behaviour was observed.

@warrenlr
Copy link

Thank you for your message. We understand how the inclusion of stop codon indicators (such as *) in sequence outputs from annotation tools like Prodigal, Prokka, and Bakta can cause issues when used with AMPlify.

While the current behaviour was designed to strictly accept only the 20 standard amino acids to ensure clean inputs, we acknowledge that many annotation tools append the stop codon symbol (*) by default, and this can indeed interfere with direct input into AMPlify.

We appreciate your suggestion to automatically handle stop codons by removing the asterisk internally. This could enhance AMPlify’s usability, especially for users working with outputs from a variety of annotation pipelines (or users who do not know about AMPlify's behaviour/have not read the documentation). We will certainly consider adding this functionality to future versions, as it could streamline workflows and reduce the need for additional preprocessing.

In the meantime, as you’ve mentioned, a simple one-liner in PERL or another scripting language can resolve this issue by removing the asterisks prior to running AMPlify. We will also make sure to update our documentation to better highlight this behaviour for users who may not be familiar with it.

Thanks again for your valuable feedback and interest in AMPlify.
Rene

@jasmezz
Copy link
Author

jasmezz commented Feb 3, 2025

That sounds great, thanks @warrenlr for considering this request for a next AMPlify release 🚀

@berkeucar
Copy link
Collaborator

Hi @jasmezz,

Thank you once again for your valuable suggestion regarding AMPlify! We truly appreciate your insights.

After carefully considering your feature request, we have decided to implement functionality to handle stop codon indications (*), as they provide users with an important distinction between biologically ‘complete’ peptides and right-truncated ones.

With the release of version 2.0.1, we now process asterisks by internally clipping them from sequences. However, users utilizing the predict script will still be able to see the original sequences, including the asterisks, as they were initially provided. This ensures a seamless experience for both training and prediction purposes. Please note that we still do not support asterisks located within the sequence or non-standard amino acids.

We sincerely appreciate your contribution in helping make AMPlify a more comprehensive tool.
Berke

@jasmezz
Copy link
Author

jasmezz commented Feb 11, 2025

That's really cool, thank you a quick response and release!
I just have to point out one tiny thing that you missed: The version number in AMPlify --help still shows v2.0.0 instead of v2.0.1. Any chance you could fix this, maybe in a patch release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants