fastgather slower than sourmash gather for more complex samples? #312

zxl124 · 2024-04-27T23:08:22Z

I was testing fastgather with some fecal microbiome sample, and noticed that while fastgather does use much less memory compared to sourmash gather, it actually takes much longer to finish. When using SRR606249 which was used in #214, I can confirm fastgather is both faster and more memory efficient than gather. The difference is that my test sample (see attached signature file) is more complex, i.e. has a larger signature file than SRR606249 has.

Test results:
fastgather (branchwater v0.9.3): on 8 cpus, 572 minutes, 19.3G peak rss.
gather (sourmash v4.8.8): on 1 cpu, 170 minutes, 59.8G peak rss.

Test parameters:
k=51, database file is GTDB-RS207 all genomes at k=51.
commands:
sourmash scripts fastgather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_fastgather.csv 2> test_fastgather.log
sourmash gather fecal_ref.sig k51/gtdb-rs207.genomic.k51.zip -k 51 -o test_gather.csv > test_gather.log

Please see the signature file and fastgather log.
fecal_ref.sig.zip
test_fastgather.log

Any idea why and if there's a quick remedy for this?

The text was updated successfully, but these errors were encountered:

ctb · 2024-04-29T14:45:53Z

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

ctb · 2024-04-29T14:47:18Z

I will say fastgather uses a much dumber algorithm than sourmash gather so that key parts of the loop can be parallelized. I could dimly see a situation where a community with certain types of structure could go faster with sourmash gather, but ... it's hard to understand exactly how!

zxl124 · 2024-04-29T17:00:02Z

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

I will rerun with -c 8 and report back. But given the user time was 4269 minutes, ~7.5 times of the real time from the time command, it's likely that fastgather was using the 8 cpus.

zxl124 · 2024-04-29T17:05:54Z

I will say fastgather uses a much dumber algorithm than sourmash gather so that key parts of the loop can be parallelized. I could dimly see a situation where a community with certain types of structure could go faster with sourmash gather, but ... it's hard to understand exactly how!

I don't know how the algorithm behind fastgather works, but I do see that with my sample, it took >1600 iterations, but with SRR606249, it only took ~80 iterations. One question I had was sourmash gather runs a prefetch step, according to the documentation, but I don't know if fastgather does that too. I was trying to find a way to feed prefetech results to fastgather but I don't know how.
If you are interested, this sample is metagenomic sequencing data of this human fecal sample.

ctb · 2024-04-29T18:40:19Z

Ironically, the only thing that fastgather does is a prefetch - but it does it every iteration, because it can be parallelized. So it's running 1600 searches across a large database. And I can see why that would be slow!

This also means that you've got a community with over 1600 distinct matches - so, a pretty complex community.

Thanks for the sketch! It will help us benchmark.

There is an option that we've been exploring to combine genome sketches into species-level pangenome sketches, see sourmash-bio/sourmash#2903 (and links therein). It's still research level but my bet is that it would solve this particular problem, if all you're interested in is species level matches. If you're interested in individual strains and/or genomes, you'd need to follow up with more detailed searches, but the first species-level breakdown would be much faster.

zxl124 · 2024-04-30T03:47:17Z

wow, very unexpected. What happens if you do -c 8 with fastgather, to make sure you're using all 8 CPUs?

I can confirm that adding -c 8 does not change the performance of fastgather.

zxl124 · 2024-08-09T20:55:28Z

This is to update that fastgather is only slower than gather when sample and database are both complex. When I switched to using much smaller "representative genomes" version of GTDB database release 214, with the same signature file, gather (v4.8.8) takes 25m49s to finish, and fastgather (v0.9.5) takes 15m4s to finish.

ctb · 2024-08-25T20:06:20Z

thanks @zxl124 - that's still really slow 😭

I've been doing some benchmarking here,

sourmash-bio/sourmash#3232

and also I've been increasingly immersed in the Rust code. There are definitely a few things that I/you/we can try, including -

using fastmultigather with a RocksDB index (built with sourmash scripts index - fastgather can't use them currently, or at least not efficiently, see why can't fastgather use rocksdb databases? #223)
removing unmatched hashes - implement fastgather optimization: remove query hashes that are not matched by anything in the prefetch stage #178
implementing the smarter algorithm used in regular sourmash (which I call pygather), but in Rust - I think we actually already have it implemented, but it is not being used anywhere.

ctb mentioned this issue Apr 29, 2024

fastgather is faster than fastmultigather in loading the database #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastgather slower than sourmash gather for more complex samples? #312

fastgather slower than sourmash gather for more complex samples? #312

zxl124 commented Apr 27, 2024

ctb commented Apr 29, 2024 •

edited

Loading

ctb commented Apr 29, 2024

zxl124 commented Apr 29, 2024

zxl124 commented Apr 29, 2024 •

edited

Loading

ctb commented Apr 29, 2024

zxl124 commented Apr 30, 2024

zxl124 commented Aug 9, 2024

ctb commented Aug 25, 2024

fastgather slower than sourmash gather for more complex samples? #312

fastgather slower than sourmash gather for more complex samples? #312

Comments

zxl124 commented Apr 27, 2024

ctb commented Apr 29, 2024 • edited Loading

ctb commented Apr 29, 2024

zxl124 commented Apr 29, 2024

zxl124 commented Apr 29, 2024 • edited Loading

ctb commented Apr 29, 2024

zxl124 commented Apr 30, 2024

zxl124 commented Aug 9, 2024

ctb commented Aug 25, 2024

ctb commented Apr 29, 2024 •

edited

Loading

zxl124 commented Apr 29, 2024 •

edited

Loading