Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count() and add_count() could be much faster #6806

Open
DavisVaughan opened this issue Mar 23, 2023 · 1 comment
Open

count() and add_count() could be much faster #6806

DavisVaughan opened this issue Mar 23, 2023 · 1 comment

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Mar 23, 2023

Right now these eventually just do summarise(n = n()) or mutate(n = n()) at some point, but that can be very slow with many groups. We already have vec_count(), which should be much much faster than count() with many groups. We could also add some kind of vctrs primitive that works like a windowed count for add_count(), or just build on top of vec_count()'s result plus an additional call to vec_match().

We'd have to think through how weighted counts would work, maybe vec_count() needs support for a weight argument (a double vector).

Motivation is something like this, and flights isn't even that big. Roughly 55k groups here.

library(dplyr)
library(nycflights13)

bench::mark(
  count(flights, dep_time, dep_delay),
  vctrs::vec_count(flights[c("dep_time", "dep_delay")]),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                                                 min  median itr/s…¹
#>   <bch:expr>                                            <bch:tm> <bch:t>   <dbl>
#> 1 count(flights, dep_time, dep_delay)                    419.6ms 441.4ms    2.27
#> 2 vctrs::vec_count(flights[c("dep_time", "dep_delay")])   17.3ms  21.5ms   42.7 
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> #   variable name ¹​`itr/sec`

Also need to handle the fact that ... and wt are data-masking, probably with add_computed_columns() like distinct().

@krlmlr
Copy link
Member

krlmlr commented Dec 18, 2024

Is this solved by duckplyr? 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants