Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional speed improvements for 'agg' from profiling #56

Merged
merged 5 commits into from
Dec 10, 2020

Conversation

chacalle
Copy link
Collaborator

@chacalle chacalle commented Dec 9, 2020

Describe changes

I used R profiling tools to speed up the aggregation function when working with a large dataset. Included examples as comments to show how I profiled.

Here are the timings from the performance vignette with these changes #55

> agg_timings
    col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
 1:      age    interval       1     data.table       13,632     0.081    0.005   0.088
 2:      age    interval       1 hierarchyUtils       13,632     4.001    0.213   4.269
 3:      age    interval      10     data.table      136,320     0.155    0.025   0.183
 4:      age    interval      10 hierarchyUtils      136,320     3.310    0.192   3.526
 5:      age    interval     100     data.table    1,363,200     0.739    0.235   0.989
 6:      age    interval     100 hierarchyUtils    1,363,200     9.141    1.084  10.406
 7:      age    interval    1000     data.table   13,632,000     6.993    2.573   9.762
 8:      age    interval    1000 hierarchyUtils   13,632,000    55.354    8.871  65.327
 9:      sex categorical       1     data.table       13,632     0.009    0.001   0.009
10:      sex categorical       1 hierarchyUtils       13,632     0.086    0.007   0.092
11:      sex categorical      10     data.table      136,320     0.063    0.009   0.073
12:      sex categorical      10 hierarchyUtils      136,320     0.759    0.048   0.817
13:      sex categorical     100     data.table    1,363,200     0.605    0.082   0.700
14:      sex categorical     100 hierarchyUtils    1,363,200     6.453    0.558   7.169
15:      sex categorical    1000     data.table   13,632,000     5.692    1.021   6.987
16:      sex categorical    1000 hierarchyUtils   13,632,000    57.967    5.711  65.056

What issues are related

Related to #47

Checklist

Packages Repositories

  • Have you read the contributing guidelines for ihmeuw-demographics R packages?
  • Have you successfully run devtools::check() locally?
  • Have you updated or added function (and vignette if applicable) documentation? Did you update the 'man' and 'NAMESPACE' files with devtools::document()?
  • Have you added in tests for the changes included in the PR?
  • Do the changes follow the ihmeuw-demographics code style?
  • Do the changes need to be immediately included in a new build of docker-base or docker-internal? If so follow directions in those repositories to rebuild and redeploy the images.
  • Do the changes require updates to other repositories which use this package? If yes, make the necessary updates in those repos, and consider integration tests for those repositories.

@chacalle chacalle added the enhancement New feature or request label Dec 9, 2020
@chacalle chacalle self-assigned this Dec 9, 2020
@chacalle
Copy link
Collaborator Author

chacalle commented Dec 9, 2020

I used the profvis R package to identify which parts of hierarchyUtils::agg were taking up the most time.

Under the hood it uses "uses data collected by Rprof, which is part of the base R distribution. At each time interval (profvis uses a default interval of 10ms), the profiler stops the R interpreter, looks at the current function call stack, and records it to a file. Because it works by sampling, the result isn’t deterministic. Each time you profile your code, the result will be slightly different."

profvis returns an interactive graphical interface that tells me about how much time was spent in each line and connects to the exact lines of code in the package. For example below I can see that over the entire profiling period about ~18 seconds were spent within check_agg_scale_subtree_dt. This allows me to pick exact spots to speed up.

Screen Shot 2020-12-09 at 12 33 20 PM

library(hierarchyUtils)
library(data.table)
library(profvis)

n_draws <- 1000

# default variables for aggregation timings
age_mapping <- data.table(age_start = c(0, seq(0, 90, 5)), age_end = c(Inf, seq(5, 95, 5)))
sex_mapping <- data.table(parent = "all", child = c("male", "female"))
agg_id_vars <- list(
  location = 1,
  year_start = seq(1950, 2020, 1),
  sex = c("male", "female"),
  age_start = seq(0, 95, 1),
  value1 = 1, value2 = 1
)

# create input dataset
agg_id_vars <- copy(agg_id_vars)
agg_id_vars[["draw"]] <- 1:n_draws
input_dt <- do.call(CJ, agg_id_vars)

# add interval end columns
input_dt[, year_end := year_start + 1]
input_dt[, age_end := age_start + 1]
input_dt[age_start == 95, age_end := Inf]

# identify value and id cols
value_cols <- grep("value", names(input_dt), value = TRUE)
id_cols <- names(input_dt)[!names(input_dt) %in% value_cols]

profvis::profvis(
  expr = {
    hierarchyUtils_output_dt <- agg(
      dt = input_dt,
      id_cols = id_cols, value_cols = value_cols,
      col_stem = "age", col_type = "interval",
      mapping = age_mapping
    )
  },
  interval = 0.005
)

R/agg_scale.R Show resolved Hide resolved
all = T,
by = diagnostic_id_cols
)
diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols]
Copy link
Collaborator Author

@chacalle chacalle Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section checks that all expected rows in dt exist. profvis identified this merge as taking about ~19 seconds for the largest aggregation to all-ages. Switching to the data.table join reduced it to about ~8 seconds.

This speed difference is explained in an upcoming data.table vignette https://github.com/zeomal/data.table/blob/patch-1/vignettes/datatable-joins.Rmd#L293

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be done by merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to the merge function. So this is the same as diagnostic_dt <- merge(dt, expected_dt, by = diagnostic_id_cols, all.y = TRUE)

Just below the speed section in that vignette

Base R's `merge` can be used for all the traditional left, right, inner etc. type of joins - they will not be as efficient as `data.table`'s `[` operator. Below is a table summary of the syntax comparisons using `data.table`'s `[` and R's `merge` (assuming only one key, `key` for simplicity):

| Join              | `data.table`                       | `merge`                                  |
|-------------------|:----------------------------------|:----------------------------------------|
| Left  | `Y[X, on = "key"]`                 | `merge(X, Y, by = "key", all.x = TRUE)`  |
| Right  | `X[Y, on = "key"]`                 | `merge(X, Y, by = "key", all.y = FALSE)` |
| Inner `X` on `Y` | `X[Y, on = "key", nomatch = NULL]` | `merge(X, Y, by = "key")`                |
| Full  | (check above)                                | `merge(X, Y, by = "key", all = TRUE)`    |

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll include a reference to this in the data.table hub page

@chacalle chacalle requested a review from krpaulson December 9, 2020 21:24
R/agg_scale.R Show resolved Hide resolved
R/agg_scale.R Show resolved Hide resolved
all = T,
by = diagnostic_id_cols
)
diagnostic_dt <- dt[expected_dt, on = diagnostic_id_cols]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be done by merge?

@krpaulson
Copy link
Collaborator

In the PR description, maybe include the speed vignette output before the changes as well as after the changes? Not crucial here, but could be useful if you create similar PRs in the future.

@chacalle
Copy link
Collaborator Author

Added a section to the packageTemplate wiki now https://github.com/ihmeuw-demographics/packageTemplate/wiki/Profiling-R-Functions

@chacalle chacalle merged commit 4902f23 into feature/vignette_performance Dec 10, 2020
@chacalle chacalle deleted the feature/profiling_improvements branch December 10, 2020 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants