Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of multidog #12

Open
alethere opened this issue Oct 15, 2020 · 2 comments
Open

Improvement of multidog #12

alethere opened this issue Oct 15, 2020 · 2 comments

Comments

@alethere
Copy link

Hi, I've been using updog to genotype GBS data for a whole genome as part of my PhD thesis and thus I have very big datasets (normally of the order of 300K variants but one of them happened to be 1.8M variants). At the time I started genotyping multidog was not implemented, so I wrote my own parallel implementation of flexdog.

Initially I had a major issue with memory and time efficiency in my computer cluster, which I managed to solve with some nifty tricks. Now I saw that there's the new multidog and I've been doing some tests (not done yet) in which it seems that my approach is almost 1000x more memory efficient and 40x faster than the current multidog implementation. Using profvis readings, running 5k markers on 25 cores multidog took 11332Mb and 28520ms where my implementation took 11.6Mb and 730ms. I attach the profvis object for you to check (to view unzip the file, and load into R using result <- readRDS(); use print(result) after loading the profvis library, and select the "Data" tab, not the "Flame Graph" tab).
updog_test_profiling.zip

I'd be happy to collaborate on the code but it would imply a major re-write of how multidog works. Is that okay? Should I just put forth a pull request? (I haven't used Github a lot so a bit of guidance would be helpful)

Cheers,
Alejandro Thérèse Navarro

@dcgerard
Copy link
Owner

This looks really cool!

If I am reading the profiling correctly, it seems most of the memory issues are occurring during the following lines?

inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))
snpdf <- do.call("rbind", lapply(outlist, function(x) x[[2]]))

Which occur here:

inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))

In which case maybe data.table::rbindlist() would be a quick way to correct the memory issues? Not sure how much this would help with speed.

I would love to see your solution! If you want to give GitHub a try, then you could fork the updog repo, modify the repo, and submit a pull request: https://guides.github.com/activities/forking/

I would suggest just adding a file with your functions, rather than modifying multidog(), so we can add a unit test that they produce the same output. We can later change the current version of multidog() to something like multidog_old(). There are a lot of functions that depend on having the output formatted the way multidog() returns (a list with two data frames, inddf and snpdf), so I would like to keep the output format the same.

If you prefer collaborating by email, that's fine too!

Thanks!
David

@alethere
Copy link
Author

Nice, I'll prepare a pull request then. My code is adapted for my own use atm so I need to generalize it before adding it but that shouldn't take too long.

Indeed it seems that it's the list modification rather than the parallel loop itself what's taking up a lot of memory and time. I have some ideas on how to handle these issues at least for very large datasets. I'll send you an e-mail with the ideas let's see what you think.

Cheers,
Alejandro

dcgerard added a commit that referenced this issue Jan 20, 2021
- I added a .combine function for foreach(). We will see if this improves the profiling.
dcgerard added a commit that referenced this issue Jan 20, 2021
- I now use the iterators package to just send parts of the data to each process, rather than sending the whole data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants