Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bakta having problems handling Ns - invalid DNA characters #87

Closed
RotimiDada opened this issue Dec 4, 2021 · 8 comments
Closed

Bakta having problems handling Ns - invalid DNA characters #87

RotimiDada opened this issue Dec 4, 2021 · 8 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@RotimiDada
Copy link

Thank you @oschwengers and the team for introducing this tool. Your adherence to the FAIR principles is a huge contribution.

I have contigs from reference-based alignment and some genes of interest in the contigs. I need to annotate the contigs to get some information for downstream analyses. My problem is with a call from bakta that "fasta sequence contains invalid DNA characters". My guess is that Ns are called invalid DNA characters by bakta.

Here is the content of the log file showing the error message:

15:30:05.858 - ERROR - FASTA - import: Fasta sequence contains invalid DNA characters! id=%s
15:30:05.859 - ERROR - MAIN - wrong genome file format!
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/bakta/main.py", line 124, in main
contigs = fasta.import_contigs(cfg.genome_path)
File "/usr/local/lib/python3.9/site-packages/bakta/io/fasta.py", line 26, in import_contigs
raise ValueError(f'Fasta sequence contains invalid DNA characters! id={record.id}')
ValueError: Fasta sequence contains invalid DNA characters! id=INOLLH026C
15:30:05.862 - INFO - MAIN - removed tmp dir: /tmp/tmpzubdk5dw

Thank you for your help
----Rotimi

@RotimiDada RotimiDada added the enhancement New feature or request label Dec 4, 2021
@oschwengers
Copy link
Owner

oschwengers commented Dec 4, 2021

Hi @RotimiDada , thanks for reporting. Since you mentioned aligned contigs:

We strive to have Bakta accepting allmost all valid IUPAC nucleotide characters of a DNA Fasta file. Currently, these are: ATGC, N and the ambiguity codes MRWSYKVHDBN.

Due to the fact that it is not supported by 3rd party tools involved in the workflow, the only character that is excluded on purpose is -. Could you therefore check if that is included in contig INOLLH026C?

@oschwengers oschwengers self-assigned this Dec 4, 2021
@RotimiDada
Copy link
Author

Thank you Oliver for a super fast response. I have checked the contigs and can't seem to find "-" character in them.

INOLLH026C.txt

I am attaching the fasta file for you to see if you could also reproduce this error.

By the way, Prokka annotates these files without encountering errors, but I need the annotation to conform to the nomenclature of the databases that I used for calling my genes of interest (e.g. Virulencefinder) - FAIR.......

@oschwengers
Copy link
Owner

That sounds interesting. Unfortunately, the file you've provided is not the Fasta file. Could you attach the input Fasta file you've used for the annotation so I can take a look at that?

@RotimiDada
Copy link
Author

Thank you once again Oliver. I am sorry. I don't know I erroneously sent you an annotation file. Please find attached the fasta file.

---Kind regards,
Rotimi
agg3CD_INOLLH026C.fa.zip

@oschwengers
Copy link
Owner

oschwengers commented Dec 6, 2021

Dear Rotimi,
I found a total of 9 dashes (-) in your Fasta file, for instance in line 277 character 39:
AATATCCTGAAGAGTTTTGCTCCTGGTAATTAATTATT-CTGAATTATTACCTTACATGG

These are not compatible with 3rd party tools, e.g. Infernal that are used in the workflow. When I remove all these dashes (Prokka does that automatically) Bakta successfully annotates this amended Fasta file:
bakta.zip

Thank you very much for reporting and bringing up this issue. As this might affect other users as well, I will add an automated removal of dashes soon.

@RotimiDada
Copy link
Author

Dear Oliver,

Many thanks for your help. I can also confirm that after removing the dashes, bakta ran successfully. I am sorry for having to make you spot the dashes yourself, after I failed to detect dashes in my first attempt. By the way, thank you for planning to automate dash removal.

Warm regards,
---Rotimi

@oschwengers
Copy link
Owner

You're welcome! For the sake of documentation, a soon-to-come commit will address this issue. Therefore, I'll keep this still open for a while.
Best regards!

@RotimiDada
Copy link
Author

My thoughts exactly. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants