Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bakta Prokka SNP Comparison #257

Open
whottel opened this issue Feb 12, 2025 · 3 comments
Open

Bakta Prokka SNP Comparison #257

whottel opened this issue Feb 12, 2025 · 3 comments

Comments

@whottel
Copy link

whottel commented Feb 12, 2025

Hello, I was trying out the most recent version of the pipeline using bakta and compared to running with prokka on a set of CRAB sequences. Similar to a previous issue I brought up a few months ago when the default aligner was changed from roary to panaroo, I found that which gene annotation was used had a significant impact on the resulting SNP matrix and interpretation.

Please find attached an excel file that includes a comparison of output matrices and core genome metrics.

Matrix Comparison.xlsx

Up to this point I have been using prokka and roary, so the first matrix is essentially the status quo from my point of view. To focus on one part of the matrix, S19-S23 are all within two SNPs, but fewer than 10 SNPs apart from a few others included in the analysis and not more than 51 SNPs to any other sequence.

In the second matrix (bakta/rorary). S19-S23 now looks to be split into two subclusters, and more surprising to me are now >1000 SNPs apart from all other sequences.

In the third matrix, since the default annotator/aligner is Bakta/panaroo, I ran the same analysis this way as well. Another slightly different interpretation here. S19-S23 are no longer drastically different from the others as with bakta/roary, but there are other differences such as S22 no longer clusters with S19-S21, S23.

The final matrix is generated by BugSeq’s refMLST method and appears to most closely resemble the prokka/roary matrix.

I can share the fastqs files if you are interested.

Thanks,
Wes

@erinyoung
Copy link
Member

I think we should write a paper together.

I also want to compare using annotations from pgap in addition to prokka and roary.

For core genome comparison, I'd like to add in pirate, ppanggolin, ksnp4, poppunk, fastani, skani, and mash.

We could add in bugseq too.

And then I really want to throw a wrench into the analysis by using only the chromosomal sequence.

The focus would be on the utility of these tools for public health outbreak investigations. (Something that expands on https://pubmed.ncbi.nlm.nih.gov/31682222/)

Wanna collaborate?!?!?!?

@whottel
Copy link
Author

whottel commented Feb 12, 2025

That sounds great!
I would need to get the okay from my lab leadership especially if we want to include BugSeq.

@whottel
Copy link
Author

whottel commented Feb 21, 2025

Hi Erin,

In case you did not see my email I sent to your utah.gov address, could we set up a call to discuss this collaboration.

Thanks,
Wes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants