linked-views.Rmd

# (PART) Linking multiple views {-}

# Introduction
\sectionmark{Introduction}

Linking of multiple data views offers a powerful approach to visualization as well as communication of structure in high-dimensional data. In particular, linking of multiple 1-2 dimensional statistical graphics can often lead to insight that a single view could not possibly reveal. For decades, statisticians and computer scientists have been using and authoring systems for multiple linked views, many of which can be found in [ASA's video library](http://stat-graphics.org/movies). Some noteworthy videos include [focussing and linking](http://stat-graphics.org/movies/focussing-linking.html), [missing values](http://stat-graphics.org/movies/missing-data.html), and [exploring Tour De France data](http://stat-graphics.org/movies/tour-de-france.html) [@xgobi; @mondrianbook].

These early systems were incredibly sophisticated, but the interactive graphics they produce are not easily shared, replicated, or incorporated in a larger document. Web technologies offer the infrastructure to address these issues, which is a big reason why many modern interactive graphics systems are now web based. When talking about interactive _web-based_ graphics, it's important to recognize the difference between a web application and a purely client-side webpage, especially when it comes to saving, sharing, and hosting the result. 

A web application relies on a client-server relationship where the client's (i.e., end user) web browser requests content from a remote server. This model is necessary whenever the webpage needs to execute computer code that is not natively supported by the client's web browser. As Chapter \@ref(linking-views-with-shiny) details, the flexibility that a web application framework, like **shiny**, offers is an incredibly productive and powerful way to link multiple data views; but when it comes to distributing a web application, it introduces a lot of complexity and computational infrastructure that may or may not be necessary.

Figure \@ref(fig:server-client) is a basic illustration of the difference between a web application and a purely client-side webpage. Thanks to `JavaScript` and `HTML5`, purely client-side webpages can still be dynamic without any software dependencies besides a modern web browser. In fact, Section \@ref(graphical-queries) outlines **plotly**'s graphical querying framework for linking multiple plots entirely client-side, which makes the result very easy to distribute (see Chapter \@ref(saving)). There are, of course, many useful examples of linked and dynamic views that cannot be easily expressed as a database query, but a surprising amount actually can, and the remainder can likely be quickly implemented as a **shiny** web application.

```{r server-client, echo = FALSE, fig.cap = "(ref:server-client)"}
knitr::include_graphics("images/server-client.svg")
```

The graphical querying framework implemented by **plotly** is inspired by @Buja:1991vh, where direct manipulation of graphical elements in multiple linked plots is used to perform database queries and visually reveal high-dimensional structure in real-time. @Cook:2007uk goes on to argue this framework is preferable to posing database queries dynamically via menus, as described by @Ahlberg:1991, and goes on to state that, "Multiple linked views are the optimal framework for posing queries about data". The next section shows you how to implement similar graphical queries in a standalone webpage using R code.

# Client-side linking

## Graphical queries

\indexc{highlight\_key()}
\indexc{highlight()}

This section focuses on a particular approach to linking views known as graphical (database) queries using the R package **plotly**. With **plotly**, one can write R code to pose graphical queries that operate entirely client-side in a web browser (i.e., no special web server or callback to R is required). In addition to teaching you how to pose queries with the `highlight_key()` function, this section shows you how to control queries that are triggered and visually rendered via the `highlight()` function.

Figure \@ref(fig:link-intro) shows a scatterplot of the relationship between weight and miles per gallon of 32 cars. It also uses `highlight_key()` to assign the number of cylinders to each point so that when a particular point is 'queried', all points with the same number of cylinders are highlighted (the number of cylinders is displayed with text just for demonstration purposes). By default, a mouse click triggers a query, and a double-click clears the query, but both of these events can be customized through the `highlight()` function. By typing `help(highlight)` in your R console, you can learn more about what events are supported for turning graphical queries `on` and `off`.

```r
library(plotly)
mtcars %>%
  highlight_key(~cyl) %>%
  plot_ly(
    x = ~wt, y = ~mpg, text = ~cyl, mode = "markers+text", 
    textposition = "top", hoverinfo = "x+y"
  ) %>%
  highlight(on = "plotly_hover", off = "plotly_doubleclick")
```

```{r link-intro, echo = FALSE, fig.cap = "(ref:link-intro)"}
include_vimeo("317134071")
```

Generally speaking, `highlight_key()` assigns data values to graphical marks so that when graphical mark(s) are *directly manipulated* through the `on` event, it uses the corresponding data values (call it `$SELECTION_VALUE`) to perform an SQL query of the following form.

```sql
SELECT * FROM mtcars WHERE cyl IN $SELECTION_VALUE
```

For a more useful example, let's use graphical querying to pose interactive queries of the `txhousing` dataset. This data contains monthly housing sales in Texan cities acquired from the [TAMU real estate center](http://recenter.tamu.edu/) and made available via the **ggplot2** package. Figure \@ref(fig:txmissing) shows the median house price in each city over time which produces a rather busy (spaghetti) plot. To help combat the overplotting, we could add the ability to click a particular data point along a line to highlight that particular city. This interactive ability is enabled by simply using `highlight_key()` to declare that the `city` variable be used as the querying criteria within the graphical querying framework. 

One subtlety to be aware of in terms of what makes Figure \@ref(fig:txmissing) possible is that every point along a line may have a different data value assigned to it. In this case, since the `city` column is used as both the visual grouping *and* querying variable, we effectively get the ability to highlight a group by clicking on any point along that line. Section \@ref(trellis-linking) has examples of using different grouping and querying variables to query multiple related groups of visual geometries at once, which can be a powerful technique.^[This sort of idea relates closely to the notion of generalized selections as described in @heer2008generalized.]

```r
# load the `txhousing` dataset
data(txhousing, package = "ggplot2")

# declare `city` as the SQL 'query by' column
tx <- highlight_key(txhousing, ~city)

# initiate a plotly object
base <- plot_ly(tx, color = I("black")) %>% 
  group_by(city)

# create a time series of median house price
base %>%
  group_by(city) %>%
  add_lines(x = ~date, y = ~median)
```

```{r txmissing, echo = FALSE, fig.cap = "(ref:txmissing)"}
include_vimeo("317137926")
```

\index{Persistent selection}

Querying a city via direct manipulation is somewhat helpful for focussing on a particular time series, but it's not so helpful for querying a city by name and/or comparing multiple cities at once. As it turns out, **plotly** makes it easy to add a selectize.js powered dropdown widget for querying by name (aka indirect manipulation) by setting `selectize = TRUE`.^[The title that appears in the dropdown can be controlled via the `group` argument in the `highlight_key()` function. The primary purpose of the `group` argument is to isolate one group of linked plots from others.] When it comes to comparing multiple cities, we want to be able to both retain previous selections (`persistent = TRUE`) as well as control the highlighting color (`dynamic = TRUE`). This video explains how to use these features in Figure \@ref(fig:txmissing-modes) to compare pricing across different cities.

```{r, echo = FALSE}
# hack to isolate the querying
tx <- highlight_key(txhousing, ~city, " ")
base <- plot_ly(tx, color = I("black")) %>% 
  group_by(city)
time_series <- base %>%
  group_by(city) %>%
  add_lines(x = ~date, y = ~median)
```

```r
highlight(
  time_series, 
  on = "plotly_click", 
  selectize = TRUE, 
  dynamic = TRUE, 
  persistent = TRUE
)
```

```{r txmissing-modes, echo = FALSE, fig.cap = "(ref:txmissing-modes)"}
if (knitr::is_html_output()) {
  highlight(time_series, on = "plotly_click", selectize = TRUE, dynamic = TRUE, persistent = TRUE)
} else {
  knitr::include_graphics("images/txmissing-modes.png")
}
```

By querying a few different cities in Figure \@ref(fig:txmissing-modes), one obvious thing we can learn is that not every city has complete pricing information (e.g., South Padre Island, San Marcos, etc.). To learn more about what cities are missing information as well as how that missingness is structured, Figure \@ref(fig:txmissing-linked) links a view of the raw time series to a dot-plot of the corresponding number of missing values per city. In addition to making it easy to see how cities rank in terms of missing house prices, it also provides a way to query the corresponding time series (i.e., reveal the structure of those missing values) by brushing cities in the dot-plot. This general pattern of linking aggregated views of the data to more detailed views fits the famous and practical information visualization advice from @details-on-demand: "Overview first, zoom and filter, then details on demand".

\index{Data-plot-pipeline}

```{r, echo = FALSE}
tx <- highlight_key(txhousing, ~city, "  ")
base <- plot_ly(tx, color = I("black")) %>% 
  group_by(city)
time_series <- base %>%
  group_by(city) %>%
  add_lines(x = ~date, y = ~median)
```

```r
# remember, `base` is a plotly object, but we can use dplyr verbs to
# manipulate the input data 
# (`txhousing` with `city` as a grouping and querying variable)
dot_plot <- base %>%
  summarise(miss = sum(is.na(median))) %>%
  filter(miss > 0) %>%
  add_markers(
    x = ~miss, 
    y = ~forcats::fct_reorder(city, miss), 
    hoverinfo = "x+y"
  ) %>%
  layout(
    xaxis = list(title = "Number of months missing"),
    yaxis = list(title = "")
  ) 

subplot(dot_plot, time_series, widths = c(.2, .8), titleX = TRUE) %>%
  layout(showlegend = FALSE) %>%
  highlight(on = "plotly_selected", dynamic = TRUE, selectize = TRUE)
```

```{r txmissing-linked, echo = FALSE, fig.cap = "(ref:txmissing-linked)"}
if (knitr::is_html_output()) {
  dot_plot <- base %>%
    summarise(miss = sum(is.na(median))) %>%
    filter(miss > 0) %>%
    add_markers(x = ~miss, y = ~forcats::fct_reorder(city, miss), hoverinfo = "x+y") %>%
    layout(
      xaxis = list(title = "Number of months missing"),
      yaxis = list(title = "")
    ) 

  subplot(dot_plot, time_series, widths = c(0.2, 0.8), titleX = TRUE) %>%
    layout(showlegend = FALSE) %>%
    highlight(on = "plotly_selected", dynamic = TRUE, selectize = TRUE)
} else {
  knitr::include_graphics("images/txmissing-linked.png")
}
```

How does **plotly** know to highlight the time series when markers in the dot-plot are selected? The answer lies in what data values are embedded in the graphical markers via `highlight_key()`. When 'South Padre Island' is selected, as in Figure \@ref(fig:pipeline-diagram), it seems as though the logic says to simply change the color of any graphical elements that match that value, but the logic behind **plotly**'s graphical queries is a bit more subtle and powerful. Another, more accurate, framing of the logic is to first imagine a linked database query being performed behind the scenes (as in Figure \@ref(fig:pipeline-diagram)). When 'South Padre Island' is selected, it first filters the aggregated dot-plot data down to just that one row, then it filters down the raw time-series data down to every row with 'South Padre Island' as a city. The drawing logic will then call [`Plotly.addTrace()`](https://plot.ly/javascript/plotlyjs-function-reference/#plotlyaddtraces) with the newly filtered data which adds a new graphical layer representing the selection, allowing us to have fine-tuned control over the visual encoding of the data query.

```{r pipeline-diagram, echo = FALSE, fig.cap = "(ref:pipeline-diagram)"}
knitr::include_graphics("images/pipeline.svg")
```

The biggest advantage of drawing an entirely new graphical layer with the filtered data is that it becomes easy to leverage [statistical trace types](https://plot.ly/r/statistical-charts/) for producing summaries that are conditional on the query. Figure \@ref(fig:txhousing-aggregates) leverages this functionality to dynamically produce probability densities of house price in response to a query event. Section \@ref(statistical-queries) has more examples of leveraging statistical trace types with graphical queries.

\index{layout()@\texttt{layout()}!barmode@\texttt{barmode}!overlay}

```{r, echo = FALSE}
tx <- highlight_key(txhousing, ~city, "   ")
base <- plot_ly(tx, color = I("black")) %>% 
  group_by(city)
time_series <- base %>%
  group_by(city) %>%
  add_lines(x = ~date, y = ~median)
```

```r
hist <- add_histogram(
  base,
  x = ~median, 
  histnorm = "probability density"
)
subplot(time_series, hist, nrows = 2) %>%
  layout(barmode = "overlay", showlegend = FALSE) %>%
  highlight(
    dynamic = TRUE, 
    selectize = TRUE, 
    selected = attrs_selected(opacity = 0.3)
  )
```

\index{add\_trace()@\texttt{add\_trace()}!add\_histogram()@\texttt{add\_histogram()}!histnorm@\texttt{histnorm}}
\index{layout()@\texttt{layout()}!barmode@\texttt{barmode}}
\indexc{attrs\_selected()}

```{r txhousing-aggregates, echo = FALSE, fig.cap = "(ref:txhousing-aggregates)"}
if (knitr::is_html_output()) {
  hist <- base %>% add_histogram(x = ~median, histnorm = "probability density")
subplot(time_series, hist, nrows = 2) %>%
  layout(barmode = "overlay", showlegend = FALSE) %>%
  highlight(dynamic = TRUE, selectize = TRUE, selected = attrs_selected(opacity = 0.3))
} else {
  knitr::include_graphics("images/txhousing-aggregates.png")
}
```

Another neat consequence of drawing a completely new layer is that we can control the  plotly.js attributes in that layer through the `selected` argument of the `highlight()` function. In Figure \@ref(fig:txhousing-aggregates) we use it to ensure the new highlighting layer has some transparency to more easily compare the city specific distribution to the overall distribution.

<!--
Figure \@ref(fig:txhousing-miss) is a fairly simple example of how **plotly**'s graphical querying framework allows you to think of linking of plots similar to how you'd link tables in a SQL database. This approach, although perhaps not completely intuitive at first, grants a nice balance between flexibility and productiveness when creating linked views.
-->

This section is designed to help give you a foundation for leveraging graphical queries in your own work. Hopefully by now you have a rough idea what graphical queries are, how they can be useful, and how to create them with `highlight_key()` and `highlight()`. Understanding the basic idea is one thing, but applying it effectively to new problems is another thing entirely. To help spark your imagination and demonstrate what's possible, Section \@ref(querying-examples) has numerous subsections each with numerous examples of graphical queries in action. 


## Highlight versus filter events {#filter}

\index{Crosstalk filters}

Section \@ref(graphical-queries) provides an overview of **plotly**'s framework for *highlight* events, but it also supports *filter* events. These events trigger slightly different logic:

* A highlight event dims the opacity of existing marks, then adds an additional graphical layer representing the selection.
* A filter event completely removes existing marks and rescale axes to the remaining data.^[When using `ggplotly()`, you need to specify `dynamicTicks = TRUE`.]

Figure \@ref(fig:filter-highlight) provides a quick visual depiction in the difference between filter and highlight events. At least currently, filter events must be fired from filter widgets from the **crosstalk** package, and these widgets expect an object of class `SharedData` as input. As it turns out, the `highlight_key()` function, introduced in Section \@ref(graphical-queries), creates a `SharedData` instance and is essentially a wrapper for `crosstalk::SharedData$new()`. 

```{r}
class(highlight_key(mtcars))
```

Figure \@ref(fig:filter-highlight) demonstrates the main difference in logic between filter and highlight events. Notice how, in the code implementation, the 'querying variable' definition for filter events is part of the filter widget. That is, `city` is defined as the variable of interest in `filter_select()`, not in the creation of `tx`. That is (intentionally) different from the approach for highlight events, where the 'querying variable' is a property of the dataset behind the graphical elements. 

\index{Crosstalk filters!filter\_select()@\texttt{filter\_select()}}

```r
library(crosstalk)

# generally speaking, use a "unique" key for filter, 
# especially when you have multiple filters!
tx <- highlight_key(txhousing)
gg <- ggplot(tx) + geom_line(aes(date, median, group = city))
filter <- bscols(
  filter_select("id", "Select a city", tx, ~city),
  ggplotly(gg, dynamicTicks = TRUE),
  widths = c(12, 12)
)

tx2 <- highlight_key(txhousing, ~city, "Select a city")
gg <- ggplot(tx2) + geom_line(aes(date, median, group = city))
select <- highlight(
  ggplotly(gg, tooltip = "city"), 
  selectize = TRUE, persistent = TRUE
)

bscols(filter, select)
```

```{r filter-highlight, echo = FALSE, fig.cap = "(ref:filter-highlight)"}
include_vimeo("307598256")
```

When using multiple filter widgets to filter the same dataset, as done in Figure \@ref(fig:multiple-filter-widgets), you should avoid referencing a non-unique querying variable (i.e., key-column) in the `SharedData` object used to populate the filter widgets. Remember that the default behavior of `highlight_key()` and `SharedData$new()` is to use the row-index (which is unique). This ensures the intersection of multiple filtering widgets queries the correct subset of data.

\index{Crosstalk filters!filter\_slider()@\texttt{filter\_slider()}}
\index{Crosstalk filters!filter\_checkbox()@\texttt{filter\_checkbox()}}

```r
library(crosstalk)
tx <- highlight_key(txhousing)
widgets <- bscols(
  widths = c(12, 12, 12),
  filter_select("city", "Cities", tx, ~city),
  filter_slider("sales", "Sales", tx, ~sales),
  filter_checkbox("year", "Years", tx, ~year, inline = TRUE)
)
bscols(
  widths = c(4, 8), widgets, 
  plot_ly(tx, x = ~date, y = ~median, showlegend = FALSE) %>% 
    add_lines(color = ~city, colors = "black")
)
```

```{r multiple-filter-widgets, echo=FALSE, fig.cap="(ref:multiple-filter-widgets)"}
include_vimeo("307598347")
```

\index{Graphical queries with leaflet}

As Figure \@ref(fig:plotly-leaflet-filter) demonstrates, filter and highlight events can work in conjunction with various **htmlwidgets**. In fact, since the semantics of filter are more well defined than highlight, linking filter events across **htmlwidgets** via **crosstalk** should generally be more well supported.^[All R packages with **crosstalk** support are currently listed here: https://rstudio.github.io/crosstalk/widgets.html]

```r
library(leaflet)

eqs <- highlight_key(quakes)
stations <- filter_slider(
  "station", "Number of Stations", 
  eqs, ~stations
)

p <- plot_ly(eqs, x = ~depth, y = ~mag) %>% 
  add_markers(alpha = 0.5) %>% 
  highlight("plotly_selected")

map <- leaflet(eqs) %>% 
  addTiles() %>% 
  addCircles()

bscols(
  widths = c(6, 6, 3), 
  p, map, stations
)
```

```{r plotly-leaflet-filter, echo=FALSE, fig.cap="(ref:plotly-leaflet-filter)"}
include_vimeo("307597495")
```

When combining filter and highlight events, one (current) limitation to be aware of is that the highlighting variable has to be nested inside filter variable(s). For example, in Figure \@ref(fig:gapminder-filter-highlight), we can filter by continent and highlight by country, but there is currently no way to highlight by continent and filter by country.

```r
library(gapminder)
g <- highlight_key(gapminder, ~country)
continent_filter <- filter_select(
  "filter", "Select a country", 
  g, ~continent
)

p <- plot_ly(g) %>%
  group_by(country) %>%
  add_lines(x = ~year, y = ~lifeExp, color = ~continent) %>%
  layout(xaxis = list(title = "")) %>%
  highlight(selected = attrs_selected(showlegend = FALSE))

bscols(continent_filter, p, widths = 12)
```

```{r gapminder-filter-highlight, echo=FALSE, fig.cap="(ref:gapminder-filter-highlight)"}
include_vimeo("307598672")
```

<!--
PROBLEM: If you wanted to filter by continent and highlight by country using gapminder, you can't currently do that as plotly will lose the row index information. 

Have `highlight_key()` support a 

```r
library(gapminder)
g <- SharedData$new(gapminder)
continent_filter <- filter_select("filter", "Select a country", g, ~continent)

# PROPOSED SOLUTION: have this sort of use of highlight_key() somehow retain the row index information, but still query by country
p <- plot_ly(g) %>%
  highlight_key(~country) %>%
  group_by(country) %>%
  add_lines(x = ~year, y = ~lifeExp, color = ~continent) %>%
  layout(xaxis = list(title = ""))

bscols(continent_filter, p, widths = 12)
```

-->


## Linking animated views

The graphical querying framework (Section \@ref(graphical-queries)) works in tandem with key-frame animations Chapter \@ref(animating-views). Figure \@ref(fig:gapminder-highlight-animation) extends Figure \@ref(fig:animation-ggplotly) by layering on linear models specific to each frame and specifying `continent` as a key variable. As a result, one may interactively highlight any continent they wish, and track the relationship through the animation. In the animated version of Figure \@ref(fig:animation-ggplotly), the user highlights the Americas, which makes it much easier to see that the relationship between GDP per capita and life expectancy was very strong starting in the 1950s, but progressively weakened throughout the years.

```r
g <- highlight_key(gapminder, ~continent)
gg <- ggplot(g, aes(gdpPercap, lifeExp, 
  color = continent, frame = year)) +
  geom_point(aes(size = pop, ids = country)) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_x_log10()
highlight(ggplotly(gg), "plotly_hover")
```

```{r gapminder-highlight-animation, echo = FALSE, fig.cap = "(ref:gapminder-highlight-animation)"}
include_vimeo("307789187")
```

In addition to highlighting objects within an animation, objects may also be linked between animations. Figure \@ref(fig:animation-gapminder) links two animated views: on the left-hand side is population density by country, and on the right-hand side is GDP per capita versus life expectancy. By default, all of the years are shown in black and the current year is shown in red. By pressing play to animate through the years, we can see that all three of these variables have increased (on average) fairly consistently over time. By linking the animated layers, we may condition on an interesting region of this data space to make comparisons in the overall relationship over time. 

For example, in Figure \@ref(fig:animation-gapminder), countries below the 50th percentile in terms of population density are highlighted in blue, then the animation is played again to reveal a fairly interesting difference in these groups. From 1952 to 1977, countries with a low population density seem to enjoy large increases in GDP per capita and moderate increases in life expectancy, then in the early 80s, their GPD seems to decrease while the life expectancy greatly increases. In comparison, the high-density countries seem to enjoy a more consistent and steady increase in both GDP and life expectancy. Of course, there are a handful of exceptions to the overall trend, such as the noticeable drop in life expectancy for a handful of countries during the nineties, which are mostly African countries feeling the effects of war.

The `gapminder` data does not include a measure of population density, but the `gap` dataset (included with the **plotlyBook** R package) adds a column containing the population per square kilometer (`popDen`), which helps implement Figure \@ref(fig:animation-gapminder). In order to link the animated layers (i.e., red points), we need another version of `gap` that marks the country variable as the link between the plots (`gapKey`).

\index{layout()@\texttt{layout()}!2D Axes!type@\texttt{type}}

```r
data(gap, package = "plotlyBook")

gapKey <- highlight_key(gap, ~country)

p1 <- plot_ly(gap, y = ~country, x = ~popDen, hoverinfo = "x") %>%
  add_markers(alpha = 0.1, color = I("black")) %>%
  add_markers(
    data = gapKey, 
    frame = ~year, 
    ids = ~country, 
    color = I("red")
  ) %>%
  layout(xaxis = list(type = "log"))

p2 <- plot_ly(gap, x = ~gdpPercap, y = ~lifeExp, size = ~popDen, 
              text = ~country, hoverinfo = "text") %>%
  add_markers(color = I("black"), alpha = 0.1) %>%
  add_markers(
    data = gapKey, 
    frame = ~year, 
    ids = ~country, 
    color = I("red")
  ) %>%
  layout(xaxis = list(type = "log"))

subplot(p1, p2, nrows = 1, widths = c(0.3, 0.7), titleX = TRUE) %>%
  hide_legend() %>%
  animation_opts(1000, redraw = FALSE) %>%
  layout(hovermode = "y", margin = list(l = 100)) %>%
  highlight(
    "plotly_selected", 
    color = "blue", 
    opacityDim = 1, 
    hoverinfo = "none"
  )
```

```{r animation-gapminder, echo = FALSE, fig.cap = "(ref:animation-gapminder)"}
include_vimeo("307789070")
```


<!--
IDEAS:
  * Link a grand tour to parallel coords (pedestrians examples?)
  * Demonstrate high variance in density estimation via binning (i.e., same data and different anchor points for the bins can result in very different values for the binned frequencies)
  -->