refactor puzzling loop in attributes.R #1327

maelle · 2024-03-26T14:03:51Z

Lines 557 to 564 in 36ad3c9

    
             for (i in seq_along(value)) { 
        
               tmp <- value[[i]] 
        
               length(tmp) <- 0 
        
               length(tmp) <- length(vs) 
        
               tmp[index] <- value[[i]] 
        
               value[[i]] <- tmp 
        
             } 
        
           }

maelle · 2024-03-26T14:04:10Z

L559 can be deleted, the rest is more difficult.

maelle · 2024-03-26T14:09:17Z

value is a named list.

the code above is for the case where we don't have as many values as vertices. we create a list as long as the number of vertices. then the indices are used to fill the list elements. the other elements remain NULL.

surely there's an elegant way to do this?!

maelle · 2024-03-26T14:09:22Z

elegant and readable

maelle · 2024-03-29T11:13:52Z

first fill with NA, then make NA NULL? 🤔 no, they are NA in the current version.

maelle · 2024-03-29T14:25:56Z

a good thing is that the lines are covered by tests

maelle · 2024-03-29T14:26:43Z

first make the loop easier then use purrr::reduce?

maelle · 2024-03-29T14:41:40Z

      tmp <- rep(NA, length(vs))
      tmp[index] <- value[[i]]
      value[[i]] <- tmp

maelle · 2024-03-29T14:42:33Z

oooh for value a vector, we get NA as fillers but for value a named list, we get NULL as fillers.

clpippel · 2024-04-02T19:31:03Z

My attempt to isolate the essence of the coding:

vs           <- c(1,2,3,4)
index        <- c(1,4)
value        <- list(list(a= 10, d = 14))
names(value) <- "color"

i            <- 1
tmp          <- value[[i]]         # Copy structure of value.
length(tmp)  <- 0                  # Set list to empty, while removing names.
length(tmp)  <- length(vs)         # Extend to all vertices with NULL.
tmp[index]   <- value[[i]]         # Copy indexed elements,
value[[1]]   <- tmp                # into value.
names(value[[i]])

The effect of length(tmp) <- 0 is to remove the names initially present in value.
When leaving out the statement the names in value in this example are: [1] "a" "d" "" "" .

In this example using the old logic the expression
>value[[1]]["a"] will give

$<NA>
NULL

thereby blocking indexing by name. In the new logic the result will be:

$a
[1] 10

Removing the names could be a matter of defensive coding.
All the code need is some explanation.

clpippel · 2024-04-03T11:04:13Z

Create an empty named list.

L558, L559 are equivalent to:
tmp <- setNames(list(), character(0)) # Easy to read.
or alternatively
tmp <- list(dummy = 1)[NULL] # Faster.
or
tmp <- value[[i]][NULL] # As current coding.

To replace lines L558, L559, L560 by a single line:
tmp <- 'length<-'(list(dummy = 1)[NULL], length(vs))

maelle · 2024-04-04T09:52:43Z

thanks! @krlmlr helped me shorten the code, I'll soon finish up the related PR.

clpippel · 2024-04-07T08:48:19Z

My thoughts.
The proposed solution is indeed concise. It uses the match() function, which can be costly if the graph has many vertices.
A similar solution can be achieved in base R by lapply() instead of map().
A base R solution using indexing by a vector is more verbose, but faster.

# -------------------------------
library(purrr)
library(microbenchmark)
n <- 1E5; m <- 1E2;
vs                <- seq(n)
index             <- seq(m)
names(index)      <- index
value             <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)) )

microbenchmark(
"This"  = { value_nw1 <- value
           for (i in seq_along(value)) { 
             tmp            <- value[[i]] 
             length(tmp)    <- 0 
             length(tmp)    <- length(vs) 
             tmp[index]     <- value[[i]] 
             value_nw1[[i]] <- tmp 
           } 
          },
"Purrr"  = { value_nw2 <- purrr::map(value, ~.x[match(seq_along(vs), index)]) },
"Rmatch" = { value_nw3 <- lapply(value, function(x) { x[match(seq_along(vs), index)]} ) },
"Rbase"  = { value_nw4 <-
  lapply(
    value,
    function(x)
      { tmp <- x[NULL]; length(tmp) <- length(vs); tmp[index] <- x; tmp}
  )
},
times= 100,
unit="relative"
)
# Unit: relative
#    expr      min       lq     mean   median       uq       max neval
#    This 9.187689 6.649874 4.097758 6.472774 5.603021 1.0953775   100
#   Purrr 4.552082 3.618245 2.618367 3.620971 3.551356 0.9894297   100
#  Rmatch 3.655005 3.473635 2.954367 3.458746 3.226840 1.0147469   100
#   Rbase 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
identical(value_nw2, value_nw3)
# [1] TRUE
identical(value_nw1, value_nw4)
# [1] TRUE

maelle · 2024-04-09T12:35:37Z

note to self: use rlang's "purrr" standalone file

krlmlr · 2024-04-09T12:56:51Z

Thanks, good catch. Using match() indeed causes an overhead. Let's go with lapply() (or map() from https://github.com/r-lib/rlang/blob/main/R/standalone-purrr.R) plus the Rbase variant plus comments why we chose this solution: don't want to slow down code that is used for many operations.

For continuous benchmarking, https://github.com/lorenzwalthert/touchstone does a decent job, but we'd need a good set of fast test cases -- can be small initially and grow over time.

krlmlr · 2024-04-09T12:57:53Z

For reference, my own analysis, with two other variants.

n <- 1E5
m <- 1E2
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This         2.79ms   3.01ms      316.    8.79MB     68.1
#> 2 Rmatch       1.48ms   1.76ms      543.    4.42MB    125. 
#> 3 vec_match    2.36ms   2.62ms      321.    4.99MB     72.9
#> 4 value_at   733.41µs   1.12ms      681.    4.61MB    176. 
#> 5 Rbase         774µs   1.15ms      759.    4.59MB    206.

^{Created on 2024-04-09 with reprex v2.1.0}

krlmlr · 2024-04-09T13:01:00Z

Looking at the reprex results, the self-contained function doesn't seem to be that bad. With a larger input:

n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This        17.47ms   18.5ms      51.2      50MB     79.6
#> 2 Rmatch      21.03ms   22.5ms      44.3    42.2MB     44.3
#> 3 vec_match   29.86ms   33.8ms      29.9    49.6MB     39.8
#> 4 value_at     7.85ms    9.5ms     104.     45.8MB    173. 
#> 5 Rbase        9.34ms   11.5ms      89.6    45.8MB    122.

^{Created on 2024-04-09 with reprex v2.1.0}

clpippel · 2024-04-09T17:29:37Z

If no named list is needed, set_value_at() can be simplified to

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}

clpippel · 2024-04-09T19:22:02Z

Another idea: Using magrittr s pipe %>%, assuming dependency is not a drawback.
Update: not faster.

#-------------------------------------------------------------------------------
library(purrr)
library(bench)
 
n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2a <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2b <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "pipe_at" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
      # set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
# n = 1E6, m = 1E3
# A tibble: 6 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result           memory              time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>           <list>              <list>           <list>            
# 1 This         23.9ms  24.37ms      40.1    45.8MB     45.8     7     8      175ms <named list [2]> <Rprofmem [91 × 3]> <bench_tm [15]>  <tibble [15 × 3]> 
# 2 Rmatch       22.6ms  22.75ms      43.8      42MB     73.0     6    10      137ms <named list [2]> <Rprofmem [49 × 3]> <bench_tm [16]>  <tibble [16 × 3]> 
# 3 vec_match    44.5ms  45.06ms      22.1    49.6MB     14.7     6     4      271ms <named list [2]> <Rprofmem [48 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
# 4 value_at     13.3ms   13.6ms      70.7    45.8MB     70.7    12    12      170ms <named list [2]> <Rprofmem [76 × 3]> <bench_tm [24]>  <tibble [24 × 3]> 
# 5 pipe_at       1.8ms   1.93ms     502.     15.3MB     82.9   115    19      229ms <named list [2]> <Rprofmem [42 × 3]> <bench_tm [134]> <tibble [134 × 3]>
# 6 Rbase          13ms   13.5ms      73.8    45.8MB     56.8    13    10      176ms <named list [2]> <Rprofmem [51 × 3]> <bench_tm [23]>  <tibble [23 × 3]>

clpippel · 2024-04-10T09:30:37Z

Magrittr s pipe %>% is not faster as I thought before.
Using no names seems to be the difference.

library(purrr)
library(bench)
 
n <- 1E6; m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}

bench::mark(
  "at_unname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "at_noname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at2(x, index, length(vs))
    })
  },
  "at_noname" = {
    value_nw3a <- lapply(value, function(x) {
      tmp <- vector(mode='list', length = length(vs)); tmp[index] <- x; tmp
    })
  },
  "at_pipe" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
    })
  }
)

# A tibble: 4 × 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 at_unname… 13.31ms  13.7ms      73.3    45.8MB     178.     7    17     95.5ms
# 2 at_noname…  1.79ms  2.68ms     393.     15.3MB     174.    68    30    172.9ms
# 3 at_noname   1.79ms  2.98ms     365.     15.3MB     176.    60    29    164.6ms
# 4 at_pipe     1.81ms  2.87ms     396.     15.3MB     219.    56    31    141.5ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

maelle · 2024-05-07T09:58:23Z

see current state of #1330

maelle added the upkeep maintenance, infrastructure, and similar label Mar 26, 2024

maelle closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024

maelle reopened this Mar 29, 2024

This was referenced Mar 29, 2024

refactor: use 'better' logic for updating vertex attributes #1330

Merged

refactor other loop in attributes.R #1332

Open

maelle added the hackathon-topic 📆 label Apr 9, 2024

maelle mentioned this issue Apr 9, 2024

How do we check changes don't affect performance #1338

Open

krlmlr mentioned this issue Apr 9, 2024

Investigate vec_match() performance for integers r-lib/vctrs#1927

Open

krlmlr removed the hackathon-topic 📆 label Apr 9, 2024

krlmlr assigned maelle Apr 9, 2024

aviator-app bot closed this as completed in #1330 Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor puzzling loop in attributes.R #1327

refactor puzzling loop in attributes.R #1327

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 29, 2024 •

edited

Loading

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

clpippel commented Apr 2, 2024 •

edited

Loading

clpippel commented Apr 3, 2024 •

edited

Loading

maelle commented Apr 4, 2024

clpippel commented Apr 7, 2024

maelle commented Apr 9, 2024

krlmlr commented Apr 9, 2024

krlmlr commented Apr 9, 2024

krlmlr commented Apr 9, 2024

clpippel commented Apr 9, 2024 •

edited

Loading

clpippel commented Apr 9, 2024 •

edited

Loading

clpippel commented Apr 10, 2024

maelle commented May 7, 2024

refactor puzzling loop in attributes.R #1327

refactor puzzling loop in attributes.R #1327

Comments

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 26, 2024

maelle commented Mar 29, 2024 • edited Loading

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

maelle commented Mar 29, 2024

clpippel commented Apr 2, 2024 • edited Loading

clpippel commented Apr 3, 2024 • edited Loading

maelle commented Apr 4, 2024

clpippel commented Apr 7, 2024

maelle commented Apr 9, 2024

krlmlr commented Apr 9, 2024

krlmlr commented Apr 9, 2024

krlmlr commented Apr 9, 2024

clpippel commented Apr 9, 2024 • edited Loading

clpippel commented Apr 9, 2024 • edited Loading

clpippel commented Apr 10, 2024

maelle commented May 7, 2024

maelle commented Mar 29, 2024 •

edited

Loading

clpippel commented Apr 2, 2024 •

edited

Loading

clpippel commented Apr 3, 2024 •

edited

Loading

clpippel commented Apr 9, 2024 •

edited

Loading

clpippel commented Apr 9, 2024 •

edited

Loading