Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recycling rules #13

Open
hadley opened this issue Jul 24, 2018 · 17 comments
Open

Recycling rules #13

hadley opened this issue Jul 24, 2018 · 17 comments
Labels
interface 🎁 External interface of functions

Comments

@hadley
Copy link
Member

hadley commented Jul 24, 2018

i.e. only recycle vectors of length 1 to length of longest.

Describe the rules for recycling vectors of length 0. @jimhester did we discuss those rules for glue?

@hadley
Copy link
Member Author

hadley commented Jul 24, 2018

Base R is all over the place, so I don't think we can get much inspiration from there:

# cbind() and rbind() silently drops
cbind(x = 1, y = numeric(0))
#>      x
#> [1,] 1
rbind(x = 1, y = numeric(0))
#>   [,1]
#> x    1

# data.frame() errors
data.frame(x = 1, y = numeric(0))
#> Error in data.frame(x = 1, y = numeric(0)): arguments imply differing number of rows: 1, 0

# infix operators recycle to length 0
1 + numeric(0)
#> numeric(0)
TRUE & logical()
#> logical(0)

# paste recycles to longest
paste(character(), c("x", "y"), sep = ".")
#> [1] ".x" ".y"

It seems to me like it would be safest to make recycling a length-0 vector an error.

@hadley
Copy link
Member Author

hadley commented Jul 24, 2018

The advantage of silently recycling to length zero is code like this:

xs <- list(integer(0), 1L, 1:3)
lapply(xs, function(x) tibble::tibble(x = x, y = 1))

Which would need to become:

xs <- list(integer(0), 1L, 1:3)
lapply(xs, function(x) tibble::tibble(x = x, y = rep_along(1, x)))

@hadley
Copy link
Member Author

hadley commented Jul 25, 2018

Places where this arises:

  • tibble::tibble()
  • glue::glue()
  • purrr::map2(), purrr::pmap()

@krlmlr
Copy link
Member

krlmlr commented Aug 2, 2018

Motivation for recycling to length zero: length-one columns (in the presence of columns of other length) often contain constant values which aren't important to keep if the non-length-one data is discarded.

How about providing recycling helpers?

library(tidyverse)
tibble(!!!recycle_pure(x = 1, y = character()))
#> # A tibble: 0 x 2
#> # ... with 2 variables: x <dbl>, y <chr>

tibble(!!!recycle_one_or_longest(x = 1, y = character()))
#> # A tibble: 1 x 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 <NA>

tibble(!!!recycle_safe(x = 1, y = character()))
# Error

Created on 2018-08-02 by the reprex package (v0.2.0).

@lionel-
Copy link
Member

lionel- commented Aug 24, 2018

I think recycling to length zero can be seen as one aspect of a typed form of "nil punning" for vectors in R. Punning is a lisp idiom that reduces the need to deal with edge cases by returning an infectious sentinel object whenever a proper return value can't be computed. The sentinel object propagates upward throughout the computation tree, until some caller substitutes a proper value or throws an error. With nil punning the type of the sentinel is always nil but with R vectors the type may vary and coercion semantics apply.

Here is an example that takes advantage of the infectiousness of empty vectors. Say we are taking an input vector of variable length, which might be empty, and we'd like the output to be 1 element longer, with various operations on the variable part of the output vector:

function(x) {
  n <- length(x) + 1
  out <- vector("list", n)

  out[[1]] <- "first"

  # Pun: `seq2()` returns an empty vector if it can't compute a normal value
  index <- rlang::seq2(2, n)

  # Pun: `[<-` is a noop rather than an error if `index` is empty.
  # In other words the RHS (which in this case is also empty but
  # could be a scalar) is recycled to length 0.
  out[index] <- lapply(x, `/`, 100)

  # Pun: order() returns `NULL` if `x` is empty
  order <- order(x)

  # Pun: `order + 1` is empty if `order` is empty.
  # This is recycling to length 0.
  order <- order + 1

  # Pun: Can safely reorder vector because c(1, empty) is a noop (except for the type)
  out <- out[c(1, order)]

  out
}

@hadley
Copy link
Member Author

hadley commented Aug 27, 2018

This should probably be a component of a bigger discussion about vectorisation: how to do it, and how to document it.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jun 11, 2019

This comment holds what I think are the correct rules:

tidyverse/tibble#435 (comment)

(The additional comments seem to imply that there is agreement here, but I wanted to write them down)

This exactly matches the broadcasting rules in rray, which were derived from xtensor/NumPy. Essentially when comparing two objects the recycling rules are:

  • If the lengths of both are equivalent, do nothing
  • If the length of one object is 1, recycle that to the length of the other object
  • Otherwise, error

I am a big fan of the fact that zero-length objects are not special cased here, and have the following implications:

  • Common length of 2 and 2 is 2
  • Common length of 2 and 1 is 2
  • Common length of 2 and 0 is an error
  • Common length of 0 and 1 is 0

@lionel-
Copy link
Member

lionel- commented Jun 17, 2019

The current thinking about tidyverse rules is that the common size of n and 0 is 0, not an error. The xtensor broadcasting rules are interesting. However making 0 a normal case rather than a special case doesn't seem sufficient justification for changing the rules. These rules need to be justified in terms of the consequences for users and programmers. There are two goals to balance: maximising the composability of vectorised functions, and producing intuitive behaviour for users.

As argued above, n -> 0 recycling might be viewed as "empty punning", similar to nil punning in lisp. We use an empty vector to represent an absence of correct value with type information, and turn the current computation into a no-op instead of throwing an error. The goal of punning is to reduce the number of edge cases that a programmer has to deal with. It can lead to surprising behaviour, but lisp shows that if done well it is a net gain for the practice of programming. However, for vector manipulations in R, empty punning might be more surprising than helpful.

One clear example of helpful empty punning is rlang::seq2(), which returns an empty vector when no increasing sequence can be computed. This composes well with looping over vector components because it reduces an indexing operation to a no-op when the inputs are outside the boundary conditions. This behaviour is both useful and intuitive. However, while this is a case of empty punning, this is not an instance of recycling.

Can we come up with valid uses of empty punning for data manipulation? One interesting recycling case in vctrs is vec_slice<- (and [<- in base R) where the RHS is recycled with the index. 1 -> 0 definitely helps because it makes sense to be size-agnostic when using vectors of length 1:

x[my_index] <- 1

If my_index is empty, we ignore the RHS. If it isn't, we recycle it to full length. This use case is consistent with Kirill's and Hadley's examples above where the motivation of 1 -> 0 recycling is combining vectors of arbitrary lengths with optional constants. What about n -> 0 recycling? It doesn't seem to make sense here:

x[my_index] <- 1:3

Why would we expect my_index to be either size 0 or exactly size 3? This seems like a strange combination of assumptions to make. And we don't need to make such assumption to effectively use empty punning:

# In case of empty punning, RHS is empty and there is no need for `n -> 0` recycling:
idx <- seq2(start, from)
x[idx] <- y[idx]

Similarly, I don't see what could be practical purposes of n -> 0 recycling in these cases:

1:3 + int()
#> integer(0)

purrr::map2(1:5, int(), ~ list(x, y))
#> list()

Overall I'd tend to agree that the broadcasting rules are the most obviously intuitive, and that they might help to catch data manipulation errors early. I'm wary that we might prohibit valid composition idioms that we have not considered here, but I couldn't come up with any clear idiom or pattern. Chances are that such idioms, if they exist, would be obscure and hard to read.

@krlmlr
Copy link
Member

krlmlr commented Jun 17, 2019

We currently have:

glue::glue("{1:3}, {1} and {integer()} are recyclable")
tibble::tibble(1:3, 1, integer())
#> Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 0: Column `integer()`
#> * Length 3: Column `1:3`

Created on 2019-06-17 by the reprex package (v0.3.0)

I'd argue both behaviors are surprising in different ways.

  • The missing message in glue() might just go unnoticed
  • Users might have expectations about behavior of zero-length vectors

What's the safer behavior?

Throwing an error is more conservative, and could be relaxed to recycling towards zero later if really necessary. It's also a simpler rule compared to adding 0 as a special case.

For subset assignment, the length of the vectors must match, or the RHS must have length one. x[integer()] <- 1:3 should throw an error.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jun 17, 2019

I completely agree with @krlmlr here

(update i now see that @lionel-'s example is in agreement with @krlmlr's comment)

@lionel-
Copy link
Member

lionel- commented Jun 17, 2019

FTR Kirill's post and mine are in agreement.

@hadley
Copy link
Member Author

hadley commented Jun 18, 2019

To be clear, you are all arguing that there should be no common size for integer(2) and integer(0), right? (And that the common size of integer(1) and integer(0) is zero, because vectors of length 1 can be recycled to any length)

@lionel-
Copy link
Member

lionel- commented Jun 18, 2019

Right, instead of there being two rules (0 size swallows all sizes, 1 is recycled to longest), a single rule might be better (1 is recycled to any other size), unless we find good patterns for the zero-eats-all rule.

I think at least the glue package is depending on full recycling to 0, cc @jimhester.

@jimhester
Copy link

Make sense to me

@lionel-
Copy link
Member

lionel- commented Jul 5, 2019

FYI we have started using the new rules in vctrs 0.2.0. Length-zero vectors can only be combined with length-zero and length-one vectors.

@jennybc
Copy link
Member

jennybc commented Jul 5, 2019

Didn't ? @DavisVaughan ? state the vctrs recycling rules really crisply in some thread, maybe in a nice-looking table? That would be nice to copy/paste here, if true.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 6, 2019

This is the image used in ?vec_recycle now. Notable to keep in mind is that m = 0 is valid here, and the rules still apply. 1 is the only special case.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interface 🎁 External interface of functions
Projects
None yet
Development

No branches or pull requests

6 participants