AMPlify ignores sequences containing stop codon indicator #17

jasmezz · 2025-01-31T14:02:47Z

We noticed that AMPlify strictly sticks to the 20 standard amino acids in input sequences and ignores all others, as stated in its help message:

$AMPlify -h
[...]
AMPlify v2.0.0
------------------------------------------------------
Predict whether a sequence is AMP or not.
Input sequences should be in fasta format. 
Sequences should be shorter than 201 amino acids long, 
and should not contain amino acids other than the 20 standard ones.

So far, so clear. But even if a stop codon is indicated with the commonly used asterisk *, the sequence is ignored. I believe this behaviour might not be desired, because several sequence annotation tools (e.g. Pyrodigal, Prodigal, Bakta, Prokka) append the * by default; for Prodigal, Prokka, and Bakta it is not even possible to deactivate the * as stop codon indicator. Thus, one cannot simply use the output from such annotation tools as input for AMPlify without first removing all *.

My feature request is thus, to have AMPlify accept sequences with stop codon indicator and remove the asterisk internally if necessary.

Minimum reproducible example:

Download this FASTA file: amplify-failed-genes.faa.gz (contains two sequences: one too long and one with *)

zcat amplify-failed-genes.faa.gz > amplify-failed-genes.faa
AMPlify -s amplify-failed-genes.faa

I'll link another issue where this behaviour was observed.

The text was updated successfully, but these errors were encountered:

warrenlr · 2025-01-31T17:56:48Z

Thank you for your message. We understand how the inclusion of stop codon indicators (such as *) in sequence outputs from annotation tools like Prodigal, Prokka, and Bakta can cause issues when used with AMPlify.

While the current behaviour was designed to strictly accept only the 20 standard amino acids to ensure clean inputs, we acknowledge that many annotation tools append the stop codon symbol (*) by default, and this can indeed interfere with direct input into AMPlify.

We appreciate your suggestion to automatically handle stop codons by removing the asterisk internally. This could enhance AMPlify’s usability, especially for users working with outputs from a variety of annotation pipelines (or users who do not know about AMPlify's behaviour/have not read the documentation). We will certainly consider adding this functionality to future versions, as it could streamline workflows and reduce the need for additional preprocessing.

In the meantime, as you’ve mentioned, a simple one-liner in PERL or another scripting language can resolve this issue by removing the asterisks prior to running AMPlify. We will also make sure to update our documentation to better highlight this behaviour for users who may not be familiar with it.

Thanks again for your valuable feedback and interest in AMPlify.
Rene

jasmezz · 2025-02-03T11:25:30Z

That sounds great, thanks @warrenlr for considering this request for a next AMPlify release 🚀

berkeucar · 2025-02-06T20:01:37Z

Hi @jasmezz,

Thank you once again for your valuable suggestion regarding AMPlify! We truly appreciate your insights.

After carefully considering your feature request, we have decided to implement functionality to handle stop codon indications (*), as they provide users with an important distinction between biologically ‘complete’ peptides and right-truncated ones.

With the release of version 2.0.1, we now process asterisks by internally clipping them from sequences. However, users utilizing the predict script will still be able to see the original sequences, including the asterisks, as they were initially provided. This ensures a seamless experience for both training and prediction purposes. Please note that we still do not support asterisks located within the sequence or non-standard amino acids.

We sincerely appreciate your contribution in helping make AMPlify a more comprehensive tool.
Berke

jasmezz · 2025-02-11T16:22:34Z

That's really cool, thank you a quick response and release!
I just have to point out one tiny thing that you missed: The version number in AMPlify --help still shows v2.0.0 instead of v2.0.1. Any chance you could fix this, maybe in a patch release?

This was referenced Jan 31, 2025

AMPLIFY_PREDICT exits with a ValueError while checking input nf-core/funcscan#373

Closed

Prevent Pyrodigal from adding stop codon indicators nf-core/funcscan#447

Merged

warrenlr added enhancement New feature or request good first issue Good for newcomers question Further information is requested labels Jan 31, 2025

amizeranschi mentioned this issue Feb 2, 2025

AA sequences ending with * have missing values for molecular_weight and hydrophobicity columns and inconsistent values on the CDS_stop_codon_found column in Ampcombi_summary.tsv nf-core/funcscan#449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMPlify ignores sequences containing stop codon indicator #17

AMPlify ignores sequences containing stop codon indicator #17

jasmezz commented Jan 31, 2025

warrenlr commented Jan 31, 2025

jasmezz commented Feb 3, 2025

berkeucar commented Feb 6, 2025

jasmezz commented Feb 11, 2025

AMPlify ignores sequences containing stop codon indicator #17

AMPlify ignores sequences containing stop codon indicator #17

Comments

jasmezz commented Jan 31, 2025

warrenlr commented Jan 31, 2025

jasmezz commented Feb 3, 2025

berkeucar commented Feb 6, 2025

jasmezz commented Feb 11, 2025