Bakta having problems handling Ns - invalid DNA characters #87

RotimiDada · 2021-12-04T19:35:34Z

Thank you @oschwengers and the team for introducing this tool. Your adherence to the FAIR principles is a huge contribution.

I have contigs from reference-based alignment and some genes of interest in the contigs. I need to annotate the contigs to get some information for downstream analyses. My problem is with a call from bakta that "fasta sequence contains invalid DNA characters". My guess is that Ns are called invalid DNA characters by bakta.

Here is the content of the log file showing the error message:

15:30:05.858 - ERROR - FASTA - import: Fasta sequence contains invalid DNA characters! id=%s
15:30:05.859 - ERROR - MAIN - wrong genome file format!
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/bakta/main.py", line 124, in main
contigs = fasta.import_contigs(cfg.genome_path)
File "/usr/local/lib/python3.9/site-packages/bakta/io/fasta.py", line 26, in import_contigs
raise ValueError(f'Fasta sequence contains invalid DNA characters! id={record.id}')
ValueError: Fasta sequence contains invalid DNA characters! id=INOLLH026C
15:30:05.862 - INFO - MAIN - removed tmp dir: /tmp/tmpzubdk5dw

Thank you for your help
----Rotimi

oschwengers · 2021-12-04T23:27:41Z

Hi @RotimiDada , thanks for reporting. Since you mentioned aligned contigs:

We strive to have Bakta accepting allmost all valid IUPAC nucleotide characters of a DNA Fasta file. Currently, these are: ATGC, N and the ambiguity codes MRWSYKVHDBN.

Due to the fact that it is not supported by 3rd party tools involved in the workflow, the only character that is excluded on purpose is -. Could you therefore check if that is included in contig INOLLH026C?

RotimiDada · 2021-12-05T14:07:13Z

Thank you Oliver for a super fast response. I have checked the contigs and can't seem to find "-" character in them.

INOLLH026C.txt

I am attaching the fasta file for you to see if you could also reproduce this error.

By the way, Prokka annotates these files without encountering errors, but I need the annotation to conform to the nomenclature of the databases that I used for calling my genes of interest (e.g. Virulencefinder) - FAIR.......

oschwengers · 2021-12-06T08:06:15Z

That sounds interesting. Unfortunately, the file you've provided is not the Fasta file. Could you attach the input Fasta file you've used for the annotation so I can take a look at that?

RotimiDada · 2021-12-06T11:38:02Z

Thank you once again Oliver. I am sorry. I don't know I erroneously sent you an annotation file. Please find attached the fasta file.

---Kind regards,
Rotimi
agg3CD_INOLLH026C.fa.zip

oschwengers · 2021-12-06T13:14:50Z

Dear Rotimi,
I found a total of 9 dashes (-) in your Fasta file, for instance in line 277 character 39:
AATATCCTGAAGAGTTTTGCTCCTGGTAATTAATTATT-CTGAATTATTACCTTACATGG

These are not compatible with 3rd party tools, e.g. Infernal that are used in the workflow. When I remove all these dashes (Prokka does that automatically) Bakta successfully annotates this amended Fasta file:
bakta.zip

Thank you very much for reporting and bringing up this issue. As this might affect other users as well, I will add an automated removal of dashes soon.

RotimiDada · 2021-12-06T23:05:58Z

Dear Oliver,

Many thanks for your help. I can also confirm that after removing the dashes, bakta ran successfully. I am sorry for having to make you spot the dashes yourself, after I failed to detect dashes in my first attempt. By the way, thank you for planning to automate dash removal.

Warm regards,
---Rotimi

oschwengers · 2021-12-06T23:11:04Z

You're welcome! For the sake of documentation, a soon-to-come commit will address this issue. Therefore, I'll keep this still open for a while.
Best regards!

RotimiDada · 2021-12-06T23:14:27Z

My thoughts exactly. Cheers!

RotimiDada added the enhancement New feature or request label Dec 4, 2021

oschwengers self-assigned this Dec 4, 2021

oschwengers added a commit that referenced this issue Dec 7, 2021

add CI tests for alignment gap removal #87

1672a87

oschwengers added a commit that referenced this issue Dec 7, 2021

remove alignment gaps from input sequences #87

ca3449d

oschwengers closed this as completed Dec 14, 2021

oschwengers added this to the v1.3.0 milestone Jan 5, 2022

plaquette mentioned this issue Jul 13, 2023

bakta encounters issues annotating assemblies with presence of N's #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bakta having problems handling Ns - invalid DNA characters #87

Bakta having problems handling Ns - invalid DNA characters #87

RotimiDada commented Dec 4, 2021

oschwengers commented Dec 4, 2021 •

edited

Loading

RotimiDada commented Dec 5, 2021

oschwengers commented Dec 6, 2021

RotimiDada commented Dec 6, 2021

oschwengers commented Dec 6, 2021 •

edited

Loading

RotimiDada commented Dec 6, 2021

oschwengers commented Dec 6, 2021

RotimiDada commented Dec 6, 2021

Bakta having problems handling Ns - invalid DNA characters #87

Bakta having problems handling Ns - invalid DNA characters #87

Comments

RotimiDada commented Dec 4, 2021

oschwengers commented Dec 4, 2021 • edited Loading

RotimiDada commented Dec 5, 2021

oschwengers commented Dec 6, 2021

RotimiDada commented Dec 6, 2021

oschwengers commented Dec 6, 2021 • edited Loading

RotimiDada commented Dec 6, 2021

oschwengers commented Dec 6, 2021

RotimiDada commented Dec 6, 2021

oschwengers commented Dec 4, 2021 •

edited

Loading

oschwengers commented Dec 6, 2021 •

edited

Loading