Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Memoize to sidecar cache file #228

Open
chrysn opened this issue Nov 20, 2024 · 1 comment
Open

Feature: Memoize to sidecar cache file #228

chrysn opened this issue Nov 20, 2024 · 1 comment

Comments

@chrysn
Copy link

chrysn commented Nov 20, 2024

What's the problem this feature will solve?

XLConnect is slow and memory hungry in some scenarios; that seems to be a known and nontrivial concern.

When the Excel files that are loaded change infrequently compared to R code changes, and data is only read and not written to Excel files, a viable solution might be the creation of sidecar memoization files.

Describe the solution you'd like

I suggest that the load function gets a new boolean parameter "memoize" (probably off-by-default). If that parameter is set, XLConnect comes up with a file name (maybe "oldfilename.xlsx.cache"). If that cache file does not exist, or is older than the file being loaded, regular XLConnect loading is performed, and the resulting data frame saveRDS()'d in the cache file, along with the list of any arguments that influence loading outcome. Otherwise, the cache file is loaded. The loading step contains a check for whether the arguments stored there match the ones in the function; if not, it falls back to regular loading and cache recreation.

Alternative Solutions

There are some variations on how the file name is best created, and what information is stored in the cache file:

  • It may be a good idea to store the XLConnect and R versions in there, so that after an upgrade, cache files are recreated. (Better err on the side of caution).
  • It may make sense to have a cache file per set of arguments (think "myfile.xlsx.readWorksheet(sheet=Tabelle).cache"), or some base64 encoding thereof; file name length limitations may force this into subfolders or through hashes. The main point is that if multiple loads are performed in a single application on a single file, the caches should not constantly invalidate each other.
  • It may make sense to store the cache files locally rather than next to the Excel file. On UNIX platforms, XDG_CACHE_HOME might be a good place.

A viable alternative is creating such a memoization function ad hoc or in some personal library; however, that adds hard-to-explain code to scripts primarily created by users who are more familiar with domain concepts than programming concepts.

Additional context

I'm unsure whether XLConnect supports loading data from Excel files that is spread out over multiple files (say, cross-file references or anything like that). If so, the cache file may need to keep a list of files that were loaded during the initial creation run, and check all their dates before using the cached file.

@chrysn chrysn changed the title Feature: Memoize to sidecar file Feature: Memoize to sidecar cache file Nov 20, 2024
@spoltier
Copy link
Member

Thanks for your request, @chrysn!

It seems generic memoization CRAN packages could potentially address your need without much specific code.

For example, this seems to fulfill your main requirements:

library(memoise)
library(XLConnect)

# Create a filesystem cache in an appropriate location
cache_dir <- file.path(
  Sys.getenv("XDG_CACHE_HOME", "~/.cache"),
  "xlconnect"
)

xl_cache <- cache_filesystem(cache_dir)

# Memoize the XLConnect loading function
memoized_loadWorkbook <- memoise(loadWorkbook, 
                                cache = xl_cache,
                                # Hash function will include file modification time
                                hash = function(x) {
                                  rlang::hash(list(x, file.mtime(x)))
                                })

See r-lib/memoise or https://www.r-bloggers.com/2020/04/caching-in-r/ for other packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants