Feature: Memoize to sidecar cache file #228

chrysn · 2024-11-20T10:36:07Z

What's the problem this feature will solve?

XLConnect is slow and memory hungry in some scenarios; that seems to be a known and nontrivial concern.

When the Excel files that are loaded change infrequently compared to R code changes, and data is only read and not written to Excel files, a viable solution might be the creation of sidecar memoization files.

Describe the solution you'd like

I suggest that the load function gets a new boolean parameter "memoize" (probably off-by-default). If that parameter is set, XLConnect comes up with a file name (maybe "oldfilename.xlsx.cache"). If that cache file does not exist, or is older than the file being loaded, regular XLConnect loading is performed, and the resulting data frame saveRDS()'d in the cache file, along with the list of any arguments that influence loading outcome. Otherwise, the cache file is loaded. The loading step contains a check for whether the arguments stored there match the ones in the function; if not, it falls back to regular loading and cache recreation.

Alternative Solutions

There are some variations on how the file name is best created, and what information is stored in the cache file:

It may be a good idea to store the XLConnect and R versions in there, so that after an upgrade, cache files are recreated. (Better err on the side of caution).
It may make sense to have a cache file per set of arguments (think "myfile.xlsx.readWorksheet(sheet=Tabelle).cache"), or some base64 encoding thereof; file name length limitations may force this into subfolders or through hashes. The main point is that if multiple loads are performed in a single application on a single file, the caches should not constantly invalidate each other.
It may make sense to store the cache files locally rather than next to the Excel file. On UNIX platforms, XDG_CACHE_HOME might be a good place.

A viable alternative is creating such a memoization function ad hoc or in some personal library; however, that adds hard-to-explain code to scripts primarily created by users who are more familiar with domain concepts than programming concepts.

Additional context

I'm unsure whether XLConnect supports loading data from Excel files that is spread out over multiple files (say, cross-file references or anything like that). If so, the cache file may need to keep a list of files that were loaded during the initial creation run, and check all their dates before using the cached file.

spoltier · 2024-11-22T15:56:29Z

Thanks for your request, @chrysn!

It seems generic memoization CRAN packages could potentially address your need without much specific code.

For example, this seems to fulfill your main requirements:

library(memoise)
library(XLConnect)

# Create a filesystem cache in an appropriate location
cache_dir <- file.path(
  Sys.getenv("XDG_CACHE_HOME", "~/.cache"),
  "xlconnect"
)

xl_cache <- cache_filesystem(cache_dir)

# Memoize the XLConnect loading function
memoized_loadWorkbook <- memoise(loadWorkbook, 
                                cache = xl_cache,
                                # Hash function will include file modification time
                                hash = function(x) {
                                  rlang::hash(list(x, file.mtime(x)))
                                })

See r-lib/memoise or https://www.r-bloggers.com/2020/04/caching-in-r/ for other packages

chrysn added the enhancement label Nov 20, 2024

chrysn changed the title ~~Feature: Memoize to sidecar file~~ Feature: Memoize to sidecar cache file Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Memoize to sidecar cache file #228

Feature: Memoize to sidecar cache file #228

chrysn commented Nov 20, 2024

spoltier commented Nov 22, 2024

Feature: Memoize to sidecar cache file #228

Feature: Memoize to sidecar cache file #228

Comments

chrysn commented Nov 20, 2024

What's the problem this feature will solve?

Describe the solution you'd like

Alternative Solutions

Additional context

spoltier commented Nov 22, 2024