-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy pathperformance.qmd
223 lines (146 loc) · 13.5 KB
/
performance.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
execute:
freeze: auto
---
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(targets)
library(tarchetypes)
```
# Performance {#performance}
This chapter explains simple options and settings to improve the efficiency of your `targets` pipelines. It also explains how to monitor the progress of a pipeline currently running.^[`cue = tar_cue(file = FALSE)` is no longer recommended for [cloud storage](https://books.ropensci.org/targets/data.html#cloud-storage). This unwise shortcut is no longer necessary, as of <https://github.com/ropensci/targets/pull/1181> (`targets` version >= 1.3.2.9003).]
::: {.callout-note collapse="true"}
## Summary: efficiency
Basic efficiency:
* Storage: choose [efficient data storage formats](https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats) for targets with large data output.
* Memory: consider `memory = "transient"` and the `garbage_collection` option for high-memory tasks.
* Overhead: if each task runs quickly, batch the workload into smaller numbers of targets to reduce overhead.
Parallel processing:
* Distributed computing: consider [distributed computing with `crew`](#crew).
* Worker storage: parallelize the data processing with `storage = "worker"` and `retrieval = "worker"`.
* Local targets: consider `deployment = "main"` for quick targets that do not need [parallel workers](#crew).
Esoteric optimizations:
* Metadata: set `seconds_meta_append`, `seconds_meta_upload`, and `seconds_reporter` to be kind to the local file system and R console.
* Timestamps: Set `trust_timestamps = TRUE` to avoid recomputing hashes of targets that are already up to date.
:::
::: {.callout-note collapse="true"}
## Summary: monitoring
* `targets` has functions like `tar_progress()` and `tar_watch()` to monitor the progress of the pipeline.
* Profiling with the [`proffer`](https://r-prof.github.io/proffer/) package can help discover [bottlenecks](https://en.wikipedia.org/wiki/Bottleneck_(software)).
:::
## Basic efficiency
These basic tips help most pipelines run efficiently.
### Storage
The default [data storage format](https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats) is [RDS](https://rdrr.io/r/base/readRDS.html), which can be slow and bulky for large data. For large data pipelines, consider [alternative formats](https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats) to more efficiently store and manage your data. Set the storage format using `tar_option_set()` or `tar_target()`:
```{r, eval = FALSE}
tar_option_set(format = "qs")
```
::: {.callout-tip collapse="true"}
### Tips about data formats
Some formats such as `"qs"` work on all kinds of data, whereas others like `"feather"` works only on data frames. Most non-default formats store the data faster and in smaller files than the default `"rds"` format, but they require extra packages to be installed. For example, `format = "qs"` requires the `qs` package, and `format = "feather"` requires the `arrow` package.
For extremely large datasets that cannot fit into memory, consider `format = "file"` to treat the data as a file on disk. Downstream targets are free to load only the subsets of the data they need.
:::
### Memory
`targets` makes decisions about how long to keep target results in memory and when to run garbage collection. You can customize this behavior with the `memory` and `garbage_collection` arguments of `tar_option_set()`.^[As of version 1.8.0.9011, the default behavior of `tar_make()` (encoded with `memory = "auto"`) is to release [dynamic](#dynamic) branches as soon as possible and keeps other targets in memory until the pipeline is finished. In addition, in `targets` >= 1.8.0.9003, the `garbage_collection` option can be a non-negative integer to control how often garbage collection happens (e.g. every 10th target).]
```{r, eval = FALSE}
tar_option_set(
memory = "transient",
garbage_collection = 10 # Can be an integer in version >= 1.8.9003
)
```
These arguments are available to `tar_target()` as well, although `garbage_collection` is a `TRUE`/`FALSE` value indicating whether to run garbage collection to run just before that specific target runs.
```{r, eval = FALSE}
tar_target(
name = example_data,
command = get_data(),
memory = "transient"
garbage_collection = TRUE
)
```
::: {.callout-tip collapse="true"}
### About memory and garbage collection
`memory = "transient"` tells `targets` to remove data from the R environment as soon as it is no longer needed. However, the computer memory itself is not freed until garbage collection is run, and even then, R may not decrease the size of its heap. You can run garbage collection yourself with the `gc()` function in R.
Transient memory and garbage collection have tradeoffs: the pipeline reads data from storage far more often, and these data reads take additional time. In addition, garbage collection is usually a slow operation, and repeated garbage collections could slow down a pipeline with thousands of targets.
`garbage_collection = 100` tells `targets` to run `gc()` every 100th active target, both locally and on [each parallel worker](#crew). The default value is 1000 if not specified directly.
:::
## Overhead
Each target incurs overhead, and it is not good practice to create thousands of targets which each run quickly. Instead, consider grouping the same amount of work into a smaller number of targets. See the sections on [what a target should do](https://books.ropensci.org/targets/targets.html#what-a-target-should-do) and [how much a target should do](https://books.ropensci.org/targets/targets.html#how-much-a-target-should-do).
::: {.callout-tip collapse="true"}
### About batching
Simulation studies and other iterative stochastic pipelines may need to run thousands of independent random replications. For these pipelines, consider [batching](https://books.ropensci.org/targets/dynamic.html#performance-and-batching) to reduce the number of targets while preserving the number of replications. In [batching](https://books.ropensci.org/targets/dynamic.html#performance-and-batching), each batch is a [dynamic branch](https://books.ropensci.org/targets/dynamic.html) target that performs a subset of the replications. For 1000 replications, you might want 40 batches of 25 replications each, 10 batches with 100 replications each, or a different balance depending on the use case. Functions `tarchetypes::tar_rep()`, `tarchetypes::tar_map_rep()`, and [`stantargets::tar_stan_mcmc_rep_summary()`](https://wlandau.github.io/stantargets/articles/mcmc_rep.html) are examples of [target factories](https://wlandau.github.io/targetopia/contributing.html#target-factories) that set up the batching structure without needing to understand [dynamic branching](https://books.ropensci.org/targets/dynamic.html).
:::
## Parallel processing
These tips add parallel processing to your pipeline and help you use it effectively.
### Distributed computing
Consider distributed computing with [`crew`](https://wlandau.github.io/crew/), as explained at <https://books.ropensci.org/targets/crew.html>. The `targets` package knows how to run independent tasks in parallel and wait for tasks that depend on upstream dependencies. [`crew`](https://wlandau.github.io/crew/) supports backends such as [`crew.cluster`](https://wlandau.github.io/crew.cluster/) for traditional clusters and [`crew.aws.batch`](https://wlandau.github.io/crew.aws.batch/) for AWS Batch.
### Worker storage
If you run `tar_make()` [with a `crew` controller](#crew), then parallel processes will run your targets, but the main R process still manages all the [data](#data) by default. To delegate [data](#data) management to the parallel [`crew`](#crew) workers, set the `storage` and `retrieval` settings in `tar_target()` or `tar_option_set()`:
```{r, eval = FALSE}
tar_option_set(storage = "worker", retrieval = "worker")
```
But be sure those workers have access to the data. They must either find the [local data](#data), or the targets must use [cloud storage](#cloud-storage).
### Local targets
In [distributed computing](#crew) with `targets`, not every target needs to run on a remote worker. For targets that run quickly and cheaply, consider setting `deployment = "main"` in `tar_target()` to run them on the main local process:
```{r, eval = FALSE}
tar_target(dataset, get_dataset(), deployment = "main")
tar_target(summary, compute_summary_statistics(), deployment = "main")
```
## Esoteric optimizations
These next tips are advanced and obscure, but they may still help.
### Metadata
By default, `tar_make()` writes to the R console and [local metadata files](#data) up to hundreds of times per second. And if you opt into [cloud](#cloud storage), it uploads [local metadata files](#data) to the [cloud](#cloud storage) every few seconds. Frequent uploads to the cloud can slow down system performance, and you can set `seconds_meta_upload` to control how often to upload the [local metadata files](#data). Arguments `seconds_meta_append` and `seconds_reporter` use memory to buffer writes to disk and the R console for metadata, but they risk temporarily losing information and showing incomplete information at a given moment, so they are not recommended except when the file system or R console are extremely sensitive or overburdened.
::: {.callout-caution collapse="true"}
### Caution: delayed metadata and progress messages
If `seconds_meta_append` is 15, then `tar_make()` waits at least 15 seconds before updating the [local metadata files](#data). It spends at least 15 seconds collecting a backlog new metadata, and then it writes all that metadata in bulk. `seconds_reporter` does the same thing with R console progress messages. Be warned: a long-running local target may block the R session and make the actual delay much longer. So when a target completes, `tar_make()` may not notify you immediately, and the target may not be up to date in the metadata until long after it actually finishes. So please be patient and allow the pipeline to continue until the end.
:::
### Time stamps
`targets` computes a [hash](https://en.wikipedia.org/wiki/Hash_function) to check if each target is up to date. However, because a [hash](https://en.wikipedia.org/wiki/Hash_function) can be a slow operation, `targets` tries to avoid recomputing hashes when feasible. The shortcut uses file modification time stamps: if the time stamp has not changed since last `tar_make()`, then the hash should be the same as last time, so `targets` should not need to recompute it. By default, `targets` only performs this shortcut on sufficiently modern file systems with trustworthy high-precision time stamps, but you can override this with the `trust_timestamps` argument of `tar_option_set()`.
## Monitoring progress
Even the most efficient `targets` pipelines can take time to complete because the user-defined tasks themselves are slow. There are convenient ways to monitor the progress of a running pipeline:
1. `tar_poll()` continuously refreshes a text summary of runtime progress in the R console. Run it in a new R session at the project root directory. (Only supported in `targets` version 0.3.1.9000 and higher.)
1. `tar_visnetwork()`, `tar_progress_summary()`, `tar_progress_branches()`, and `tar_progress()` show runtime information at a single moment in time.
1. `tar_watch()` launches an Shiny app that automatically refreshes the graph every few seconds.
::: {.callout-tip collapse="true"}
## Example: monitoring the pipeline with `tar_watch()`
```{r, eval = FALSE}
# Define an example target script file with a slow pipeline.
library(targets)
library(tarchetypes)
tar_script({
sleep_run <- function(...) {
Sys.sleep(10)
}
list(
tar_target(settings, sleep_run()),
tar_target(data1, sleep_run(settings)),
tar_target(data2, sleep_run(settings)),
tar_target(data3, sleep_run(settings)),
tar_target(model1, sleep_run(data1)),
tar_target(model2, sleep_run(data2)),
tar_target(model3, sleep_run(data3)),
tar_target(figure1, sleep_run(model1)),
tar_target(figure2, sleep_run(model2)),
tar_target(figure3, sleep_run(model3)),
tar_target(conclusions, sleep_run(c(figure1, figure2, figure3)))
)
})
# Launch the app in a background process.
# You may need to refresh the browser if the app is slow to start.
# The graph automatically refreshes every 10 seconds
tar_watch(seconds = 10, outdated = FALSE, targets_only = TRUE)
# Now run the pipeline and watch the graph change.
px <- tar_make()
```

`tar_watch_ui()` and `tar_watch_server()` make this functionality available to other apps through a Shiny module.
:::
## Profiling
Profiling tools like [`proffer`](https://r-prof.github.io/proffer/) figure out specific places where code runs slowly. It is important to identify these bottlenecks before you try to optimize. Steps:
1. Install the [`proffer`](https://r-prof.github.io/proffer/) R package and its dependencies.
1. Run `proffer::pprof(tar_make(callr_function = NULL))` on your project.
1. Examine the flame graph to figure out which R functions are taking the most time.
## Resource usage
The [`autometric`](https://wlandau.github.io/autometric) package can monitor the CPU and memory consumption of the various processes in a `targets` pipeline. This is mainly useful for high-performance computing workloads with [parallel workers](#crew). Please read <https://wlandau.github.io/crew/articles/logging.html> for details and examples.