-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement of multidog #12
Comments
This looks really cool! If I am reading the profiling correctly, it seems most of the memory issues are occurring during the following lines? inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))
snpdf <- do.call("rbind", lapply(outlist, function(x) x[[2]])) Which occur here: Line 364 in 6a23e6e
In which case maybe I would love to see your solution! If you want to give GitHub a try, then you could fork the updog repo, modify the repo, and submit a pull request: https://guides.github.com/activities/forking/ I would suggest just adding a file with your functions, rather than modifying If you prefer collaborating by email, that's fine too! Thanks! |
Nice, I'll prepare a pull request then. My code is adapted for my own use atm so I need to generalize it before adding it but that shouldn't take too long. Indeed it seems that it's the list modification rather than the parallel loop itself what's taking up a lot of memory and time. I have some ideas on how to handle these issues at least for very large datasets. I'll send you an e-mail with the ideas let's see what you think. Cheers, |
- I added a .combine function for foreach(). We will see if this improves the profiling.
- I now use the iterators package to just send parts of the data to each process, rather than sending the whole data.
Hi, I've been using updog to genotype GBS data for a whole genome as part of my PhD thesis and thus I have very big datasets (normally of the order of 300K variants but one of them happened to be 1.8M variants). At the time I started genotyping multidog was not implemented, so I wrote my own parallel implementation of flexdog.
Initially I had a major issue with memory and time efficiency in my computer cluster, which I managed to solve with some nifty tricks. Now I saw that there's the new multidog and I've been doing some tests (not done yet) in which it seems that my approach is almost 1000x more memory efficient and 40x faster than the current multidog implementation. Using profvis readings, running 5k markers on 25 cores multidog took 11332Mb and 28520ms where my implementation took 11.6Mb and 730ms. I attach the profvis object for you to check (to view unzip the file, and load into R using
result <- readRDS()
; useprint(result)
after loading the profvis library, and select the "Data" tab, not the "Flame Graph" tab).updog_test_profiling.zip
I'd be happy to collaborate on the code but it would imply a major re-write of how multidog works. Is that okay? Should I just put forth a pull request? (I haven't used Github a lot so a bit of guidance would be helpful)
Cheers,
Alejandro Thérèse Navarro
The text was updated successfully, but these errors were encountered: