Improvement of multidog #12

alethere · 2020-10-15T09:44:23Z

Hi, I've been using updog to genotype GBS data for a whole genome as part of my PhD thesis and thus I have very big datasets (normally of the order of 300K variants but one of them happened to be 1.8M variants). At the time I started genotyping multidog was not implemented, so I wrote my own parallel implementation of flexdog.

Initially I had a major issue with memory and time efficiency in my computer cluster, which I managed to solve with some nifty tricks. Now I saw that there's the new multidog and I've been doing some tests (not done yet) in which it seems that my approach is almost 1000x more memory efficient and 40x faster than the current multidog implementation. Using profvis readings, running 5k markers on 25 cores multidog took 11332Mb and 28520ms where my implementation took 11.6Mb and 730ms. I attach the profvis object for you to check (to view unzip the file, and load into R using result <- readRDS(); use print(result) after loading the profvis library, and select the "Data" tab, not the "Flame Graph" tab).
updog_test_profiling.zip

I'd be happy to collaborate on the code but it would imply a major re-write of how multidog works. Is that okay? Should I just put forth a pull request? (I haven't used Github a lot so a bit of guidance would be helpful)

Cheers,
Alejandro Thérèse Navarro

The text was updated successfully, but these errors were encountered:

dcgerard · 2020-10-15T13:21:15Z

This looks really cool!

If I am reading the profiling correctly, it seems most of the memory issues are occurring during the following lines?

inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))
snpdf <- do.call("rbind", lapply(outlist, function(x) x[[2]]))

Which occur here:

updog/R/multidog.R

Line 364 in 6a23e6e

inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))

In which case maybe data.table::rbindlist() would be a quick way to correct the memory issues? Not sure how much this would help with speed.

I would love to see your solution! If you want to give GitHub a try, then you could fork the updog repo, modify the repo, and submit a pull request: https://guides.github.com/activities/forking/

I would suggest just adding a file with your functions, rather than modifying multidog(), so we can add a unit test that they produce the same output. We can later change the current version of multidog() to something like multidog_old(). There are a lot of functions that depend on having the output formatted the way multidog() returns (a list with two data frames, inddf and snpdf), so I would like to keep the output format the same.

If you prefer collaborating by email, that's fine too!

Thanks!
David

alethere · 2020-10-15T14:37:16Z

Nice, I'll prepare a pull request then. My code is adapted for my own use atm so I need to generalize it before adding it but that shouldn't take too long.

Indeed it seems that it's the list modification rather than the parallel loop itself what's taking up a lot of memory and time. I have some ideas on how to handle these issues at least for very large datasets. I'll send you an e-mail with the ideas let's see what you think.

Cheers,
Alejandro

- I added a .combine function for foreach(). We will see if this improves the profiling.

- I now use the iterators package to just send parts of the data to each process, rather than sending the whole data.

dcgerard added the enhancement label Nov 4, 2020

dcgerard added a commit that referenced this issue Jan 20, 2021

Try to work on #12.

3986e06

- I added a .combine function for foreach(). We will see if this improves the profiling.

dcgerard added a commit that referenced this issue Jan 20, 2021

More work on #12.

7404baa

- I now use the iterators package to just send parts of the data to each process, rather than sending the whole data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of multidog #12

Improvement of multidog #12

alethere commented Oct 15, 2020

dcgerard commented Oct 15, 2020

alethere commented Oct 15, 2020

Improvement of multidog #12

Improvement of multidog #12

Comments

alethere commented Oct 15, 2020

dcgerard commented Oct 15, 2020

alethere commented Oct 15, 2020