Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parallelization of getSheetNames #262

Closed
sgrote opened this issue Sep 27, 2021 · 2 comments
Closed

Allow parallelization of getSheetNames #262

sgrote opened this issue Sep 27, 2021 · 2 comments

Comments

@sgrote
Copy link

sgrote commented Sep 27, 2021

Is your feature request related to a problem? Please describe.
It does not seem to be possible to run openxlsx::getSheetNames in parallel. Here's my code:

getSheetNames <- function(excel_files, parallel = FALSE, BPPARAM = BiocParallel::bpparam()) {
  if (parallel) {
    sheet_list <- BiocParallel::bplapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      },
      BPPARAM = BPPARAM
    )
  } else {
    sheet_list <- lapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          warning("File '", excel_file, "' could not be found.")
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      }
    )
  }
  names(sheet_list) <- excel_files
  sheet_list
}

Now, getSheetNames(excel_files, parallel = TRUE) sometimes throws an error

 Error: BiocParallel errors
  element index: 1
  first error: cannot open file '/tmp/Rtmpxttndw/_excelXMLRead/[Content_Types].xml': No such file or directory 

or

Error: BiocParallel errors
  element index: 1
  first error: cannot open the connection 

Describe the solution you'd like
My guess would be that each of the parallel tasks writes the same temporary file.
Maybe using a directory or a file based on tempfile() would be a solution.

Describe alternatives you've considered
Parallelization does work with readxl::excel_sheets, but that is much slower than openxlsx::getSheetNames.
Another alternative is to not run it in parallel, but with a growing number of excel files this will be more time consuming.

@jmbarbone
Copy link
Contributor

I'm not familiar with BiocParallel but I don't have an issue trying to read with the parallel package and the current dev version of openxlsx

library(openxlsx)
library(parallel)
packageVersion("openxlsx")
#> [1] '4.2.4.9000'

files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
  write.xlsx(list(a = data.frame(1)), x)
}))

cl <- makeCluster(2)
res <- parSapply(cl, files, getSheetNames)
stopCluster(cl)

names(res) <- basename(names(res))
res
#> filed028503a5e45.xlsx filed0286ced491a.xlsx filed0285d1e260b.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed0283073600e.xlsx filed02835787e93.xlsx filed0281e6a4f69.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed028396510da.xlsx filed0283efb25a2.xlsx filed028127869b2.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed028696535d4.xlsx 
#>                   "a"
all(file.remove(files))
#> [1] TRUE

Created on 2021-09-27 by the reprex package (v2.0.1)

There is a line in getSheetNames() that creates a temporary directory:

openxlsx/R/wrappers.R

Lines 3401 to 3404 in 0cc668b

xmlDir <- file.path(tempdir(), "_excelXMLRead")
xmlFiles <- unzip(file, exdir = xmlDir)
on.exit(unlink(xmlDir, recursive = TRUE), add = TRUE)

Changing this could work, but I haven't been able to get the same issues.

@sgrote
Copy link
Author

sgrote commented Sep 29, 2021

Hi Jordan,
thanks for the quick reply and fix! Your example seemed to work because makeCluster uses the default type = "PSOCK".
Using a fork cluster instead, leads to the same issues as in my example above:

library(openxlsx)
library(parallel)

files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
  write.xlsx(list(a = data.frame(1)), x)
}))

cl <- makeCluster(4, type = "FORK")
res <- parSapply(cl, files, getSheetNames)
# Error in checkForRemoteErrors(val) : 
#  3 nodes produced errors; first error: cannot open file '/tmp/RtmpjScHsw/_excelXMLRead/_rels/.rels': No such file or directory

stopCluster(cl)

The issue of using the same tempdir() in forked processes is also discussed e.g. here.

This demonstrates that only the first version uses distinct tempdirs in parallel processes:

# parallel::makeCluster PSOCK
cl <- parallel::makeCluster(2, type = "PSOCK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpTMgTqX" "/tmp/RtmpTMgTqX" "/tmp/RtmpF6A2EO" "/tmp/RtmpF6A2EO"

# parallel::makeCluster FORK
cl <- parallel::makeCluster(2, type = "FORK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

# parallel::mclapply
res <- parallel::mclapply(
 1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

# BiocParallel
BiocParallel::register(BiocParallel::MulticoreParam(2))
res <- BiocParallel::bplapply(
 1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

So while using parallel::makeCluster(..., type = "PSOCK") clearly can be used for parallelization of getSheetNames,
I think it would still be great to use tempfile() instead of tempdir() to also allow parallelization with other commonly used approaches like parallel::mclapply or BiocParallel::bplapply.
Also, type = "PSOCK" seems to be slower than type = "FORK", at least in my test-case.

JanMarvin added a commit that referenced this issue Sep 30, 2021
changes in temp directory related to #262
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jan 4, 2025
# openxlsx 4.2.7.1

* It's now possible to insert a hyperlinked image by passing a URL,
  relative or absolute file path, or mailto string to the new
  `address` parameter of `insertImage()`.

# openxlsx 4.2.7

* Fixed warning on `dataValidation(..., type = "list")`
  ([#342](ycphs/openxlsx#342))

* Added optional argument to `loadWorkbook` to decide if empty/blank
  cells should be converted to NA_character_ (the default) or left
  blank as is

* `saveWorkbook()` now succeeds when called after the user has set
  column widths for a range of columns (e.g. 1:2), saved the workbook,
  then set column widths for a new range that is inclusive of the
  previous one (e.g. 1:5)
  ([#493](ycphs/openxlsx#493)).

## Improvements

* Improve detectDates
  ([#288](ycphs/openxlsx#288))

* Preserve window size and position, also `getWindowSize()` and
  `setWindowSize()`
  ([466](ycphs/openxlsx#466))

# openxlsx 4.2.6

* Fix external links
  ([#410](ycphs/openxlsx#410))

* Do not add unneccessary sheetPr node
  ([#409](ycphs/openxlsx#409))

* Add support for `namedRegion`s having dots and other special
  characters ([#338](ycphs/openxlsx#338)).

* Add type blanks and not blanks to conditional formatting
  ([#311](ycphs/openxlsx#311))

# openxlsx 4.2.5

## Fixes

* `openxlsx_setOp()` now works with named list
  ([#215](ycphs/openxlsx#215))

* `loadWorkbook()` imports `inlineStr`. Values remain `inlineStr` when
  writing the workbook with `saveWorkbook()`. Similar `read.xlsx` and
  `readWorkbook` import `inlineStr`.

* `read.xlsx()` no longer changes random seed
  ([#183](ycphs/openxlsx#183))

* fixed a regression that caused fonts to be read in incorrectly
  ([#207](ycphs/openxlsx#207))

* add option to save as read only recommended
  ([#201](ycphs/openxlsx#201))

* fixed writing hyperlink formulas
  ([#200](ycphs/openxlsx#200))

* `write.xlsx()` now throws an error if it doesn't have write
  permissions ([#190](ycphs/openxlsx#190))

* `write.xlsx()` now again uses the default of `overwrite = TRUE` for
  saving files ([#249](ycphs/openxlsx#249))

* `as.character.formula()` exported to warn about potential conflicts
  with other packages
  ([#312](ycphs/openxlsx#312),
  [#315](ycphs/openxlsx#315))

## Improvements

* `options()` are more consistently set in functions (see:
  [#289](ycphs/openxlsx#262))

* `Workbook$show()` no longer fails when called in a 0 sheet
  workbook([#240](ycphs/openxlsx#240))

* `read.xlsx()` again accepts `.xlsm` files
([#205](ycphs/openxlsx#205),
[#209](ycphs/openxlsx#209))

* `makeHyperlinkString()` does no longer require a sheet argument
  ([#57](ycphs/openxlsx#57),
  [#58](ycphs/openxlsx#58))

* improvements in how `openxlsx` creates temporary directories (see
  [#262](ycphs/openxlsx#262))

* `writeData()` calls `force(x)` to evaluate the object before options
  are set ([#264](ycphs/openxlsx#264))

* `createComment()` now correctly handles `integers` in `width` and
  `height` ([#275](ycphs/openxlsx#275))

* `setStyles()` accepts `halign="justify"`
  ([#305](ycphs/openxlsx#305))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants