Additional speed improvements for 'agg' from profiling #56

chacalle · 2020-12-09T20:26:41Z

Describe changes

I used R profiling tools to speed up the aggregation function when working with a large dataset. Included examples as comments to show how I profiled.

Here are the timings from the performance vignette with these changes #55

> agg_timings
    col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
 1:      age    interval       1     data.table       13,632     0.081    0.005   0.088
 2:      age    interval       1 hierarchyUtils       13,632     4.001    0.213   4.269
 3:      age    interval      10     data.table      136,320     0.155    0.025   0.183
 4:      age    interval      10 hierarchyUtils      136,320     3.310    0.192   3.526
 5:      age    interval     100     data.table    1,363,200     0.739    0.235   0.989
 6:      age    interval     100 hierarchyUtils    1,363,200     9.141    1.084  10.406
 7:      age    interval    1000     data.table   13,632,000     6.993    2.573   9.762
 8:      age    interval    1000 hierarchyUtils   13,632,000    55.354    8.871  65.327
 9:      sex categorical       1     data.table       13,632     0.009    0.001   0.009
10:      sex categorical       1 hierarchyUtils       13,632     0.086    0.007   0.092
11:      sex categorical      10     data.table      136,320     0.063    0.009   0.073
12:      sex categorical      10 hierarchyUtils      136,320     0.759    0.048   0.817
13:      sex categorical     100     data.table    1,363,200     0.605    0.082   0.700
14:      sex categorical     100 hierarchyUtils    1,363,200     6.453    0.558   7.169
15:      sex categorical    1000     data.table   13,632,000     5.692    1.021   6.987
16:      sex categorical    1000 hierarchyUtils   13,632,000    57.967    5.711  65.056

What issues are related

Related to #47

Checklist

Packages Repositories

Have you read the contributing guidelines for ihmeuw-demographics R packages?
Have you successfully run devtools::check() locally?
Have you updated or added function (and vignette if applicable) documentation? Did you update the 'man' and 'NAMESPACE' files with devtools::document()?
Have you added in tests for the changes included in the PR?
Do the changes follow the ihmeuw-demographics code style?
Do the changes need to be immediately included in a new build of docker-base or docker-internal? If so follow directions in those repositories to rebuild and redeploy the images.
Do the changes require updates to other repositories which use this package? If yes, make the necessary updates in those repos, and consider integration tests for those repositories.

* 'paste0' is slow on the full input 'dt' * switch to generating interval name for each unique interval, then quick data.table join with original 'dt'

chacalle · 2020-12-09T20:35:38Z

I used the profvis R package to identify which parts of hierarchyUtils::agg were taking up the most time.

Under the hood it uses "uses data collected by Rprof, which is part of the base R distribution. At each time interval (profvis uses a default interval of 10ms), the profiler stops the R interpreter, looks at the current function call stack, and records it to a file. Because it works by sampling, the result isn’t deterministic. Each time you profile your code, the result will be slightly different."

profvis returns an interactive graphical interface that tells me about how much time was spent in each line and connects to the exact lines of code in the package. For example below I can see that over the entire profiling period about ~18 seconds were spent within check_agg_scale_subtree_dt. This allows me to pick exact spots to speed up.

library(hierarchyUtils)
library(data.table)
library(profvis)

n_draws <- 1000

# default variables for aggregation timings
age_mapping <- data.table(age_start = c(0, seq(0, 90, 5)), age_end = c(Inf, seq(5, 95, 5)))
sex_mapping <- data.table(parent = "all", child = c("male", "female"))
agg_id_vars <- list(
  location = 1,
  year_start = seq(1950, 2020, 1),
  sex = c("male", "female"),
  age_start = seq(0, 95, 1),
  value1 = 1, value2 = 1
)

# create input dataset
agg_id_vars <- copy(agg_id_vars)
agg_id_vars[["draw"]] <- 1:n_draws
input_dt <- do.call(CJ, agg_id_vars)

# add interval end columns
input_dt[, year_end := year_start + 1]
input_dt[, age_end := age_start + 1]
input_dt[age_start == 95, age_end := Inf]

# identify value and id cols
value_cols <- grep("value", names(input_dt), value = TRUE)
id_cols <- names(input_dt)[!names(input_dt) %in% value_cols]

profvis::profvis(
  expr = {
    hierarchyUtils_output_dt <- agg(
      dt = input_dt,
      id_cols = id_cols, value_cols = value_cols,
      col_stem = "age", col_type = "interval",
      mapping = age_mapping
    )
  },
  interval = 0.005
)

R/interval_formatting.R

R/agg_scale.R

chacalle · 2020-12-09T20:54:27Z

R/categorical_trees.R

-    all = T,
-    by = diagnostic_id_cols
-  )
+  diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols]


This section checks that all expected rows in dt exist. profvis identified this merge as taking about ~19 seconds for the largest aggregation to all-ages. Switching to the data.table join reduced it to about ~8 seconds.

This speed difference is explained in an upcoming data.table vignette https://github.com/zeomal/data.table/blob/patch-1/vignettes/datatable-joins.Rmd#L293

Does this need to be done by merge?

This is equivalent to the merge function. So this is the same as diagnostic_dt <- merge(dt, expected_dt, by = diagnostic_id_cols, all.y = TRUE)

Just below the speed section in that vignette

Base R's `merge` can be used for all the traditional left, right, inner etc. type of joins - they will not be as efficient as `data.table`'s `[` operator. Below is a table summary of the syntax comparisons using `data.table`'s `[` and R's `merge` (assuming only one key, `key` for simplicity): | Join | `data.table` | `merge` | |-------------------|:----------------------------------|:----------------------------------------| | Left | `Y[X, on = "key"]` | `merge(X, Y, by = "key", all.x = TRUE)` | | Right | `X[Y, on = "key"]` | `merge(X, Y, by = "key", all.y = FALSE)` | | Inner `X` on `Y` | `X[Y, on = "key", nomatch = NULL]` | `merge(X, Y, by = "key")` | | Full | (check above) | `merge(X, Y, by = "key", all = TRUE)` |

I'll include a reference to this in the data.table hub page

R/agg_scale.R

krpaulson · 2020-12-10T14:04:11Z

R/categorical_trees.R

-    all = T,
-    by = diagnostic_id_cols
-  )
+  diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols]


Does this need to be done by merge?

krpaulson · 2020-12-10T14:06:48Z

In the PR description, maybe include the speed vignette output before the changes as well as after the changes? Not crucial here, but could be useful if you create similar PRs in the future.

chacalle · 2020-12-10T20:30:14Z

Added a section to the packageTemplate wiki now https://github.com/ihmeuw-demographics/packageTemplate/wiki/Profiling-R-Functions

chacalle added 4 commits December 9, 2020 12:23

remove slow 'assertive::assert_all_are_not_na' calls

05cba10

Eliminate dependency on previous interval aggregation results

82252ef

Speed up interval aggregation step that generates interval name

5c63623

* 'paste0' is slow on the full input 'dt' * switch to generating interval name for each unique interval, then quick data.table join with original 'dt'

switch to faster data.table join

a3bd71e

chacalle added the enhancement New feature or request label Dec 9, 2020

chacalle self-assigned this Dec 9, 2020

chacalle commented Dec 9, 2020

View reviewed changes

R/interval_formatting.R Show resolved Hide resolved

chacalle commented Dec 9, 2020

View reviewed changes

R/agg_scale.R Show resolved Hide resolved

chacalle commented Dec 9, 2020

View reviewed changes

add todo to switch to data.table CJ for data.tables

8917b28

chacalle requested a review from krpaulson December 9, 2020 21:24

krpaulson approved these changes Dec 10, 2020

View reviewed changes

chacalle merged commit 4902f23 into feature/vignette_performance Dec 10, 2020

chacalle deleted the feature/profiling_improvements branch December 10, 2020 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional speed improvements for 'agg' from profiling #56

Additional speed improvements for 'agg' from profiling #56

chacalle commented Dec 9, 2020 •

edited

Loading

chacalle commented Dec 9, 2020

chacalle Dec 9, 2020 •

edited

Loading

krpaulson Dec 10, 2020

chacalle Dec 10, 2020

chacalle Dec 10, 2020

krpaulson Dec 10, 2020

krpaulson commented Dec 10, 2020

chacalle commented Dec 10, 2020

Additional speed improvements for 'agg' from profiling #56

Additional speed improvements for 'agg' from profiling #56

Conversation

chacalle commented Dec 9, 2020 • edited Loading

Describe changes

What issues are related

Checklist

Packages Repositories

chacalle commented Dec 9, 2020

chacalle Dec 9, 2020 • edited Loading

Choose a reason for hiding this comment

krpaulson Dec 10, 2020

Choose a reason for hiding this comment

chacalle Dec 10, 2020

Choose a reason for hiding this comment

chacalle Dec 10, 2020

Choose a reason for hiding this comment

krpaulson Dec 10, 2020

Choose a reason for hiding this comment

krpaulson commented Dec 10, 2020

chacalle commented Dec 10, 2020

chacalle commented Dec 9, 2020 •

edited

Loading

chacalle Dec 9, 2020 •

edited

Loading