-
Notifications
You must be signed in to change notification settings - Fork 124
/
Copy pathscaling.Rmd
193 lines (99 loc) · 9.81 KB
/
scaling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# (PART) Scaling views {-}
# Introduction {#scaling}
Scaling visualization to large amounts of data requires a combination of computer engineering, statistical summaries, and creativity.
Part \@ref(scaling) covered linked views and demonstrated some techniques which can help put the famous infovis mantra into practice: "Overview first, then zoom and filter, then details on demand". In other words, don't try to show all the raw data in a single -- show interesting summaries, then provide interactive tools to extract more information. From a statistical perspective, this mantra tends to work well, because, as Hadley Wickham says: "Visualization surprise, but don't scale well. Models (i.e., summaries) scale well, but don't surprise". In other words, a model helps to reduce the amount of information to display, but it cannot point out what important information it does not capture. By providing interactive tools that can reveal detailed information behind a particular summary, you provide a better framework for questioning the assumption(s) inherent in your summarized overview of the data.
Of course, when dealing with non-trivial models (i.e., summaries), it can be quite useful to leverage a statistical computing environment.
<!--
Especially with interactive graphics (that allow for zoom+pan), it's tempting to avoid summaries and simply plot all the raw data. This can lead to interesting engineering problems, that ultimately misses the larger picture. That being said, even if you've thought carefully about leveraging summaries, having a good sense of the technical capabilities
-->
These are helpful concepts to keep in mind when designing an exploratory interface to large scale data, and you'll see several figures re-inforce these concepts, but for now we focus on the limitations in terms of rendering lots of graphical elements on a page.
Roughly speaking, the bulk of the time translates R code to an R list.
That list is then serialized as JSON (via `jsonlite::toJSON()`) and should match a JSON specification (i.e., schema) defined by the JavaScript library (which uses the JSON to render the widget).
# Scaling across views
<!-- Move trelliscopejs (navigating many views) section here? -->
# Scaling within views
## Build vs. run time
The concept of build vs. run performance is an important topic related to scaling in interactive data visualization. Since latency in interactive graphics is known to make exploratory data analysis a more challenging task [@2014-latency], systems that optimize run over build performance are typically preferrable. This is especially true for visualizations that *others* are consuming, but in a typical EDA context, where the person creating the visualization is main consumer, build time performance is also important because it also presents a hurdle to a productive analysis.
Recall the diagram of what happens when you print a **plotly** object in Figure \@ref(fig:intro-printing).
<!--
Especially for crossfilter-esque applications (see Section \@ref(crossfilter)), performing computational work in advance (i.e., at build time) can lead to a responsive experience with graphical queries on the order of millions to billions of data points [@2013-immens; @nanocubes; @2019-falcon]. Typically, to reach this kind of scale, you need a highly customized system that doesn't allow itself to be integrated easily in a larger EDA workflow. Regardless, there are clever ideas that we can use from this work, which is explored in Chapter \@ref().
-->
* SVG vs. Web-GL
* Draw comparison to pdf/png
* Borrow examples from workshop
A quick and easy way to try and improve render performance is to use Canvas-based rendering (instead of vector-based SVG) with `toWebGL(p)`. Switching from vector to Canvas is generally a good idea when dealing with >30,000 vectors, but in this case, we’re only dealing with a couple hundred vector paths, so switching from vector to Canvas for our map won’t significantly improve rendering performance, and in fact, we’ll lose some nice SVG exclusive features (the plotly.js team is getting close to eliminating these limitations!). Instead, what we could (and should!) do is reduce the amount of points along to each path (technically speaking, we’ll reduce the complexity of the SVG d attribute).
* What's in a plotly object?
* What happens at print-time?
* Build-time versus render time
* Build time
* profvis
* Render time
* SVG vs. Web-GL rendering
* Data summary/simplification
If you’ve ever found `ggplotly()` slow to print, chances are, the bulk of the time is spent building the R list and sending the JSON to plotly.js. For many htmlwidgets, the build time is negligible, but for more complex widgets like plotly, a lot of things need to happen, especially for `ggplotly()` since we call `ggplot2::ggplot_build()`, then crawl and map that data structure to plotly.js. In a **shiny** app, both the build and render stages are required on initial load, but the new `plotlyProxy()` interface provides a way to ‘cache’ expensive build (and render!) operation and update a graph by modifying just specific components of the figure (via plotly.js functions). Outside of a ‘reactive context’ like shiny, you could use `htmlwidgets::saveWidget()` to ‘cache’ the results of the build step to disk, send the file to someone else (or host it online somewhere), then only the render step is required to view the graph.
# Benchmarks
This Chapter provides some general performance benchmarks for **plotly** as of version `r packageVersion('plotly')`. The intention is to provide some intuition as to what is a reasonable threshold/expectation for working with common trace types. Recall, in general, we care more about runtime performance, especially for visualizations that you share with others.
## Build-time
For build-time performance, we
## runtime
## Scatterplots
### SVG
SVG scatterplots are notoriously slow.
Responsive up to about 10,000 points.
```r
plot_ly(x = rnorm(1e4), y = rnorm(1e4), alpha = 0.1)
```
### WebGL
Switching from SVG to a WebGL can result in _huge_ improvements in responsiveness (i.e., runtime efficiency) for scatterplots. By simply switching from `scatter` `scattergl`
```r
plot_ly(x = rnorm(1e6), y = rnorm(1e6), alpha = 0.1, type = "scattergl")
```
```r
plot_ly(x = rnorm(1e6), y = rnorm(1e6), type = "pointcloud")
```
## Line charts
### SVG
Compared to SVG scatterplots, SVG line charts can handle a much larger amount of data, partially thanks to the way SVG paths work. The main bottleneck with SVG scatterplots is the work the browser must do to manage all the DOM elements (one element per circle). On the other hand, with paths, many data points can be encoded in a single path.
```r
y <- sample(c(-1, 1), 1e6, TRUE)
x <- seq(Sys.time(), Sys.time() + length(y) - 1, by = "1 sec")
plot_ly(x = x, y = cumsum(y)) %>% add_lines()
```
### WebGL
https://workshops.cpsievert.me/20171118/slides/day1/#29
https://workshops.cpsievert.me/20171118/slides/day1/#30
```r
y <- sample(c(-1, 1), 1e7, TRUE)
x <- seq(Sys.time(), Sys.time() + length(y) - 1, by = "1 sec")
plot_ly(x = x, y = cumsum(y)) %>% add_lines()
```
## Polygons
### SVG
# Leveraging summaries
For larger data, on the order of 100 million or more observations, it probably doesn't make much sense to send as much data as you can at the browser. Even if your visualization software can handle that much data, Instead, it pays to know how to effectively leverage statistical summaries with visualizations
<!--
Sometimes you have to consider way more views than you can possibly digest visually. In Chapters \@ref(graphical-queries) and \@ref(linking-views-with-shiny), we explore some useful techniques for implementing the popular visualization mantra from @details-on-demand:
> "Overview first, zoom and filter, then details-on-demand."
In fact, Figure \@ref(fig:shiny-corrplot) from Section \@ref(scoping-events) provides an example of this mantra put into practice. The correlation matrix provides an overview of the correlation structure between all the variables, and by clicking a cell, it populates a scatterplot between those two specific variables. This works fine with tens or hundreds or variables, but once you have thousands or tens-of-thousands of variables, this technique begins to fall apart. At that point, you may be better off defining a range of correlations that you're interested in exploring, or better yet, incorporating another measure (e.g., a test statistic), then focussing on views that match a certain criteria.
-->
## Large n
### Numerical variables
@bigvis outlines a general framework (coined "bin-summarise-smooth") that balances the computational and statistical tradeoffs one must confront to visualize large numerical data in a scalable, responsive, and careful way. The core idea behind it is to leverage the computational efficiency of, as @data-cube defines them, distributive and algebraic statistics. Example of distributive statistics would be: minimum, maximum, count, and sum. These statistics are distributive
* Nod to @bigvis's description of Gray's distributive/algebraic/holistic.
* How much of this is implemented in bigvis?
### Discrete categories
Scaling in terms of the number of observations is not the problem -- how to handle scaling in the number of categories?
* Drill-down techniques
## Large p
When faced with a large number of variables, or more generally speaking, groups of distributions, we typically want to do one or more of the following:
* Find variables that account for the majority of the variation in the data.
* Discover 'unusual' variables.
* Find clusters of similar variables.
openFDA example <https://talks.cpsievert.me/20180816/>
## Large conditional distributions
### A case study with cranlogs
<https://workshops.cpsievert.me/20171118/slides/day1/#21>
```{r}
```
### A case study with openFDA data
A rehash of the end of this talk?