Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastgather slower than sourmash gather for more complex samples? #312

Open
zxl124 opened this issue Apr 27, 2024 · 8 comments
Open

fastgather slower than sourmash gather for more complex samples? #312

zxl124 opened this issue Apr 27, 2024 · 8 comments

Comments

@zxl124
Copy link

zxl124 commented Apr 27, 2024

I was testing fastgather with some fecal microbiome sample, and noticed that while fastgather does use much less memory compared to sourmash gather, it actually takes much longer to finish. When using SRR606249 which was used in #214, I can confirm fastgather is both faster and more memory efficient than gather. The difference is that my test sample (see attached signature file) is more complex, i.e. has a larger signature file than SRR606249 has.

Test results:
fastgather (branchwater v0.9.3): on 8 cpus, 572 minutes, 19.3G peak rss.
gather (sourmash v4.8.8): on 1 cpu, 170 minutes, 59.8G peak rss.

Test parameters:
k=51, database file is GTDB-RS207 all genomes at k=51.
commands:
sourmash scripts fastgather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_fastgather.csv 2> test_fastgather.log
sourmash gather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_gather.csv > test_gather.log

Please see the signature file and fastgather log.
fecal_ref.sig.zip
test_fastgather.log

Any idea why and if there's a quick remedy for this?

@ctb
Copy link
Collaborator

ctb commented Apr 29, 2024

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

@ctb
Copy link
Collaborator

ctb commented Apr 29, 2024

I will say fastgather uses a much dumber algorithm than sourmash gather so that key parts of the loop can be parallelized. I could dimly see a situation where a community with certain types of structure could go faster with sourmash gather, but ... it's hard to understand exactly how!

@zxl124
Copy link
Author

zxl124 commented Apr 29, 2024

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

I will rerun with -c 8 and report back. But given the user time was 4269 minutes, ~7.5 times of the real time from the time command, it's likely that fastgather was using the 8 cpus.

@zxl124
Copy link
Author

zxl124 commented Apr 29, 2024

I will say fastgather uses a much dumber algorithm than sourmash gather so that key parts of the loop can be parallelized. I could dimly see a situation where a community with certain types of structure could go faster with sourmash gather, but ... it's hard to understand exactly how!

I don't know how the algorithm behind fastgather works, but I do see that with my sample, it took >1600 iterations, but with SRR606249, it only took ~80 iterations. One question I had was sourmash gather runs a prefetch step, according to the documentation, but I don't know if fastgather does that too. I was trying to find a way to feed prefetech results to fastgather but I don't know how.
If you are interested, this sample is metagenomic sequencing data of this human fecal sample.

@ctb
Copy link
Collaborator

ctb commented Apr 29, 2024

Ironically, the only thing that fastgather does is a prefetch - but it does it every iteration, because it can be parallelized. So it's running 1600 searches across a large database. And I can see why that would be slow!

This also means that you've got a community with over 1600 distinct matches - so, a pretty complex community.

Thanks for the sketch! It will help us benchmark.

There is an option that we've been exploring to combine genome sketches into species-level pangenome sketches, see sourmash-bio/sourmash#2903 (and links therein). It's still research level but my bet is that it would solve this particular problem, if all you're interested in is species level matches. If you're interested in individual strains and/or genomes, you'd need to follow up with more detailed searches, but the first species-level breakdown would be much faster.

@zxl124
Copy link
Author

zxl124 commented Apr 30, 2024

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

I can confirm that adding -c 8 does not change the performance of fastgather.

@zxl124
Copy link
Author

zxl124 commented Aug 9, 2024

This is to update that fastgather is only slower than gather when sample and database are both complex. When I switched to using much smaller "representative genomes" version of GTDB database release 214, with the same signature file, gather (v4.8.8) takes 25m49s to finish, and fastgather (v0.9.5) takes 15m4s to finish.

@ctb
Copy link
Collaborator

ctb commented Aug 25, 2024

thanks @zxl124 - that's still really slow 😭

I've been doing some benchmarking here,

sourmash-bio/sourmash#3232

and also I've been increasingly immersed in the Rust code. There are definitely a few things that I/you/we can try, including -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants