-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional speed improvements for 'agg' from profiling #56
Additional speed improvements for 'agg' from profiling #56
Conversation
* 'paste0' is slow on the full input 'dt' * switch to generating interval name for each unique interval, then quick data.table join with original 'dt'
I used the profvis R package to identify which parts of Under the hood it uses "uses data collected by Rprof, which is part of the base R distribution. At each time interval (profvis uses a default interval of 10ms), the profiler stops the R interpreter, looks at the current function call stack, and records it to a file. Because it works by sampling, the result isn’t deterministic. Each time you profile your code, the result will be slightly different."
|
all = T, | ||
by = diagnostic_id_cols | ||
) | ||
diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section checks that all expected rows in dt
exist. profvis
identified this merge as taking about ~19 seconds for the largest aggregation to all-ages. Switching to the data.table join reduced it to about ~8 seconds.
This speed difference is explained in an upcoming data.table vignette https://github.com/zeomal/data.table/blob/patch-1/vignettes/datatable-joins.Rmd#L293
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be done by merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to the merge
function. So this is the same as diagnostic_dt <- merge(dt, expected_dt, by = diagnostic_id_cols, all.y = TRUE)
Just below the speed section in that vignette
Base R's `merge` can be used for all the traditional left, right, inner etc. type of joins - they will not be as efficient as `data.table`'s `[` operator. Below is a table summary of the syntax comparisons using `data.table`'s `[` and R's `merge` (assuming only one key, `key` for simplicity):
| Join | `data.table` | `merge` |
|-------------------|:----------------------------------|:----------------------------------------|
| Left | `Y[X, on = "key"]` | `merge(X, Y, by = "key", all.x = TRUE)` |
| Right | `X[Y, on = "key"]` | `merge(X, Y, by = "key", all.y = FALSE)` |
| Inner `X` on `Y` | `X[Y, on = "key", nomatch = NULL]` | `merge(X, Y, by = "key")` |
| Full | (check above) | `merge(X, Y, by = "key", all = TRUE)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll include a reference to this in the data.table hub page
all = T, | ||
by = diagnostic_id_cols | ||
) | ||
diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be done by merge?
In the PR description, maybe include the speed vignette output before the changes as well as after the changes? Not crucial here, but could be useful if you create similar PRs in the future. |
Added a section to the packageTemplate wiki now https://github.com/ihmeuw-demographics/packageTemplate/wiki/Profiling-R-Functions |
Describe changes
I used R profiling tools to speed up the aggregation function when working with a large dataset. Included examples as comments to show how I profiled.
Here are the timings from the performance vignette with these changes #55
What issues are related
Related to #47
Checklist
Packages Repositories
ihmeuw-demographics
R packages?devtools::check()
locally?devtools::document()
?ihmeuw-demographics
code style?docker-base
ordocker-internal
? If so follow directions in those repositories to rebuild and redeploy the images.