Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panaroo vs. Roary #188

Closed
whottel opened this issue Apr 12, 2024 · 9 comments
Closed

Panaroo vs. Roary #188

whottel opened this issue Apr 12, 2024 · 9 comments

Comments

@whottel
Copy link

whottel commented Apr 12, 2024

Hello,

I went back through an old cluster analysis to see what differences in interpretation would potentially arise in using with an older grandeur version that uses roary versus newer versions that are now using panaroo.
Please find a summary excel file with this comparison:
Grandeur_Version_Matrix_Comparisons_Issue.xlsx

This comparison includes sequences that were part of a 2017 P. mirabilis outbreak. Grandeur with roary identified 0-3 SNPs among these outbreak sequences. With other versions of grandeur that use panaroo, two sequences seem to form their own subcluster and differ by >200 SNPs from the other members of the outbreak cluster. Surprisingly, 2017-A and 2017-D are from the same sample but again with panaroo differ by hundreds of SNPs. Also, included is a third-party refMLST analysis, which generally agrees with the roary SNP matrix as far as interpretation.

@erinyoung
Copy link
Member

Roary is no longer maintained. (https://github.com/sanger-pathogens/Roary). I ran into many issues with Roary that several other people had also run into, such as having samples randomly being dropped from the analysis.

When it came time to update Grandeur, I knew I needed to replace Roary with something currently maintained. The options I chose to experiment with were:

Admittedly, I didn't have a lot of time for this comparison. So in addition to my sample set (a cluster of seven Serratia samples), I relied on the literature. The two most influential papers were file:///Users/eriny/Downloads/s13059-020-02090-4.pdf and https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000690.

All of these tools tout themselves as being a roary replacement, but they are all slightly different from roary and from each other.

Here's a figure that may or may not be helpful:
https://link.springer.com/article/10.1186/s13059-020-02090-4/figures/2

In summary, for core genome analysis, panaroo does identify more pairwise SNPs that Roary.

The default arguments for panaroo in Grandeur are --clean-mode strict --remove-invalid-genes, which is supposedly the best way to remove possible contamination. As observed in the file that you sent to me, the total number of genes accepted for each isolate in the analysis is smaller (i.e. 2017-30-24 goes from 3,775 genes identified by roary to 3,702 genes identified by panaroo). Panaroo is more selective as to what is actually a gene. Genes are then matched to other isolates through other methods. Roary used blast, and panaroo uses cd-hit. It looks like panaroo is using genes in the core genome that roary does not match together. Your Version v3.6.20231219 (roary) has a core genome of 3,187 genes, while version 4+ has 3,255 genes. I imagine these genes hold most of your additional SNPs.

This unfortunately, is going to be a similar story with both ppanggolin and pirate as well.

Panaroo was used for Grandeur due to its results being basically the same in my sample set and ease of use. I had issues running pirate with singularity (perl issues ftw!), and ppanggolin is much more complicated to use.

I think there is a way to adjust panaroo to run more like roary (https://gthlab.au/panaroo/#/gettingstarted/params), but the sample set that I was using doesn't look like a good fit for this. Are the samples in your comparison on the SRA? It looks like it would be perfect for fine tuning some arguments.

@whottel
Copy link
Author

whottel commented Apr 18, 2024

Hi Erin,

Thanks for providing those details.
Most of these in my example were sequenced by CDC and have not been uploaded to NCBI. However, I did reach out and there are fine with me sharing the files with you. I will send a shared drive link.

@erinyoung
Copy link
Member

I have good news!

  1. I was able to replicate your findings

Yay?

  1. I also ran these files through ppanggolin, just to see if those results to be more like roary.

These results were from using the contigs.

ppanggolin all --fastas list_of_fasta.txt --cpu 20 --outdir out
ppanggolin msa -p out/pangenome.h5 --cpu 20 --outdir out2 --phylo
snp-dists out2/core_genome_alignment.aln

This designates 3612 "persistent" genes, which is more than both roary and panaroo.

The SNPs identified, though, were much smaller. Suspiciously so.

snp-dists 0.8.2 2017-30-06      2017-30-24      2017-30-A       519946  572111
2017-30-06      0       0       0       1       0
2017-30-24      0       0       0       1       1
2017-30-A       0       0       0       1       1
519946  1       1       1       0       1
572111  0       1       1       1       0

@erinyoung
Copy link
Member

Now for some bad news.

I ran your samples by adjusting several of panaroo's parameters listed at https://gtonkinhill.github.io/panaroo/#/gettingstarted/params

I didn't see a big difference in the core genome size. When I looked at the results from snp-dists, I don't see a real difference there either.

The SNP matrices of my exploration today.

# setting -f to 0.8 (core = 3,248)
==> 0.8_grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	2017-30-A	519946	Proteus_mirabilis_GCF_000069965.1	2017-30-24	2017-30-06	572111
2017-30-A	0	73	16890	207	201	48
519946	73	0	16413	278	274	90
Proteus_mirabilis_GCF_000069965.1	16890	16413	0	17111	17095	16373
2017-30-24	207	278	17111	0	8	253
2017-30-06	201	274	17095	8	0	249
572111	48	90	16373	253	249	0
						
# setting -f to 0.9 (core = 3,240)
==> 0.9_grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	2017-30-A	572111	2017-30-24	Proteus_mirabilis_GCF_000069965.1	519946	2017-30-06
2017-30-A	0	48	207	16055	77	201
572111	48	0	253	15546	94	249
2017-30-24	207	253	0	16276	282	8
Proteus_mirabilis_GCF_000069965.1	16055	15546	16276	0	15580	16260
519946	77	94	282	15580	0	278
2017-30-06	201	249	8	16260	278	0
						
# default grandeur (core = 3,257)
==> grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	Proteus_mirabilis_GCF_000069965.1	2017-30-24	2017-30-A	572111	519946	2017-30-06
Proteus_mirabilis_GCF_000069965.1	0	18694	18468	17988	17992	18673
2017-30-24	18694	0	207	289	278	8
2017-30-A	18468	207	0	84	73	201
572111	17988	289	84	0	126	285
519946	17992	278	73	126	0	274
2017-30-06	18673	8	201	285	274	0
						
# using --clean-mode moderate (core = 3,264)
==> moderate_grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	2017-30-A	Proteus_mirabilis_GCF_000069965.1	572111	2017-30-06	519946	2017-30-24
2017-30-A	0	18469	84	201	73	208
Proteus_mirabilis_GCF_000069965.1	18469	0	17989	18674	17993	18696
572111	84	17989	0	285	126	290
2017-30-06	201	18674	285	0	274	9
519946	73	17993	126	274	0	279
2017-30-24	208	18696	290	9	279	0

# not using -a (core = 	3,257)	
==> noa_grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	2017-30-24	2017-30-06	2017-30-A	Proteus_mirabilis_GCF_000069965.1	572111	519946
2017-30-24	0	8	207	18694	289	278
2017-30-06	8	0	201	18673	285	274
2017-30-A	207	201	0	18468	84	73
Proteus_mirabilis_GCF_000069965.1	18694	18673	18468	0	17988	17992
572111	289	285	84	17988	0	126
519946	278	274	73	17992	126	0
						
# using --clean-mode sensitive (core = 3,264)
==> sensitive_grandeur/snp-dists/snp_matrix.txt <==						
snp-dists 0.8.2	2017-30-06	2017-30-A	Proteus_mirabilis_GCF_000069965.1	519946	2017-30-24	572111
2017-30-06	0	201	18674	274	9	285
2017-30-A	201	0	18469	73	208	84
Proteus_mirabilis_GCF_000069965.1	18674	18469	0	17993	18696	17989
519946	274	73	17993	0	279	126
2017-30-24	9	208	18696	279	0	290
572111	285	84	17989	126	290	0

@erinyoung
Copy link
Member

Looks like I can't use bugseq, so I can't compare refmlst. It looks like it uses minimap2 to identify genes of interest and it maps reads onto those genes.

@whottel
Copy link
Author

whottel commented Apr 23, 2024

Thanks for running all those tests, but still not a clear picture of what is going on here.
I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.

@erinyoung
Copy link
Member

Thanks for running all those tests, but still not a clear picture of what is going on here.
I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.

My apologies for my late response.

I've copied the gene_presence_absence.Rtab for panaroo and roary.

Roary uses blast to see if genes are the same between organisms. Panaroo uses CD-hit. As such, they determine different genes are shared between the samples.

For example, Roary determines arnC_1 is shared by all these samples, while panaroo does not. Similarly, arnC_4 is predicted to be shared by all samples with panaroo, but not roary. I think roary might be more effective at grouping similar genes together, which panaroo is grouping genes that might be more distant, hence why there are more SNPs.

The gene presence/absence files:
panaroo_gene_presence_absence.Rtab.txt
roary_gene_presence_absence.Rtab.txt

I have not found a combination of parameters for panaroo to adjust it such that it matches that of roary.

Pathogen Surveillance uses PIRATE (https://github.com/nf-core/pathogensurveillance/tree/dev) which is also blast based, so this workflow might be helpful to you. I would like to add PIRATE to my comparison, but I have not gotten to work with my local environment (perl issues with singularity).

The next version of grandeur (#191) will support using roary instead of panaroo by setting the param.aligner to roary (--aligner roary on the command line) so you can continue to use roary.

@erinyoung
Copy link
Member

I think bactopia also has a pirate option for pangenome analysis too.

@erinyoung
Copy link
Member

Also, I JUST FOUND THIS PAPER/TOOL (https://github.com/maxgmarin/panqc). This should add some additional QC to panaroo. I'm going to try this tomorrow (or sometime soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants