-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastgather slower than sourmash gather for more complex samples? #312
Comments
wow, very unexpected. What happens if you do |
I will say |
I will rerun with |
I don't know how the algorithm behind |
Ironically, the only thing that fastgather does is a prefetch - but it does it every iteration, because it can be parallelized. So it's running 1600 searches across a large database. And I can see why that would be slow! This also means that you've got a community with over 1600 distinct matches - so, a pretty complex community. Thanks for the sketch! It will help us benchmark. There is an option that we've been exploring to combine genome sketches into species-level pangenome sketches, see sourmash-bio/sourmash#2903 (and links therein). It's still research level but my bet is that it would solve this particular problem, if all you're interested in is species level matches. If you're interested in individual strains and/or genomes, you'd need to follow up with more detailed searches, but the first species-level breakdown would be much faster. |
I can confirm that adding |
This is to update that |
thanks @zxl124 - that's still really slow 😭 I've been doing some benchmarking here, and also I've been increasingly immersed in the Rust code. There are definitely a few things that I/you/we can try, including -
|
I was testing fastgather with some fecal microbiome sample, and noticed that while fastgather does use much less memory compared to sourmash gather, it actually takes much longer to finish. When using SRR606249 which was used in #214, I can confirm fastgather is both faster and more memory efficient than gather. The difference is that my test sample (see attached signature file) is more complex, i.e. has a larger signature file than SRR606249 has.
Test results:
fastgather (branchwater v0.9.3): on 8 cpus, 572 minutes, 19.3G peak rss.
gather (sourmash v4.8.8): on 1 cpu, 170 minutes, 59.8G peak rss.
Test parameters:
k=51, database file is GTDB-RS207 all genomes at k=51.
commands:
sourmash scripts fastgather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_fastgather.csv 2> test_fastgather.log
sourmash gather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_gather.csv > test_gather.log
Please see the signature file and fastgather log.
fecal_ref.sig.zip
test_fastgather.log
Any idea why and if there's a quick remedy for this?
The text was updated successfully, but these errors were encountered: