Allow parallelization of getSheetNames #262

sgrote · 2021-09-27T14:14:45Z

Is your feature request related to a problem? Please describe.
It does not seem to be possible to run openxlsx::getSheetNames in parallel. Here's my code:

getSheetNames <- function(excel_files, parallel = FALSE, BPPARAM = BiocParallel::bpparam()) {
  if (parallel) {
    sheet_list <- BiocParallel::bplapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      },
      BPPARAM = BPPARAM
    )
  } else {
    sheet_list <- lapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          warning("File '", excel_file, "' could not be found.")
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      }
    )
  }
  names(sheet_list) <- excel_files
  sheet_list
}

Now, getSheetNames(excel_files, parallel = TRUE) sometimes throws an error

 Error: BiocParallel errors
  element index: 1
  first error: cannot open file '/tmp/Rtmpxttndw/_excelXMLRead/[Content_Types].xml': No such file or directory

or

Error: BiocParallel errors
  element index: 1
  first error: cannot open the connection

Describe the solution you'd like
My guess would be that each of the parallel tasks writes the same temporary file.
Maybe using a directory or a file based on tempfile() would be a solution.

Describe alternatives you've considered
Parallelization does work with readxl::excel_sheets, but that is much slower than openxlsx::getSheetNames.
Another alternative is to not run it in parallel, but with a growing number of excel files this will be more time consuming.

The text was updated successfully, but these errors were encountered:

jmbarbone · 2021-09-28T00:52:30Z

I'm not familiar with BiocParallel but I don't have an issue trying to read with the parallel package and the current dev version of openxlsx

library(openxlsx)
library(parallel)
packageVersion("openxlsx")
#> [1] '4.2.4.9000'

files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
  write.xlsx(list(a = data.frame(1)), x)
}))

cl <- makeCluster(2)
res <- parSapply(cl, files, getSheetNames)
stopCluster(cl)

names(res) <- basename(names(res))
res
#> filed028503a5e45.xlsx filed0286ced491a.xlsx filed0285d1e260b.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed0283073600e.xlsx filed02835787e93.xlsx filed0281e6a4f69.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed028396510da.xlsx filed0283efb25a2.xlsx filed028127869b2.xlsx 
#>                   "a"                   "a"                   "a" 
#> filed028696535d4.xlsx 
#>                   "a"
all(file.remove(files))
#> [1] TRUE

^{Created on 2021-09-27 by the reprex package (v2.0.1)}

There is a line in getSheetNames() that creates a temporary directory:

openxlsx/R/wrappers.R

Lines 3401 to 3404 in 0cc668b

    
           xmlDir <- file.path(tempdir(), "_excelXMLRead") 
        
           xmlFiles <- unzip(file, exdir = xmlDir) 
        
           on.exit(unlink(xmlDir, recursive = TRUE), add = TRUE)

Changing this could work, but I haven't been able to get the same issues.

sgrote · 2021-09-29T08:31:41Z

Hi Jordan,
thanks for the quick reply and fix! Your example seemed to work because makeCluster uses the default type = "PSOCK".
Using a fork cluster instead, leads to the same issues as in my example above:

library(openxlsx)
library(parallel)

files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
  write.xlsx(list(a = data.frame(1)), x)
}))

cl <- makeCluster(4, type = "FORK")
res <- parSapply(cl, files, getSheetNames)
# Error in checkForRemoteErrors(val) : 
#  3 nodes produced errors; first error: cannot open file '/tmp/RtmpjScHsw/_excelXMLRead/_rels/.rels': No such file or directory

stopCluster(cl)

The issue of using the same tempdir() in forked processes is also discussed e.g. here.

This demonstrates that only the first version uses distinct tempdirs in parallel processes:

# parallel::makeCluster PSOCK
cl <- parallel::makeCluster(2, type = "PSOCK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpTMgTqX" "/tmp/RtmpTMgTqX" "/tmp/RtmpF6A2EO" "/tmp/RtmpF6A2EO"

# parallel::makeCluster FORK
cl <- parallel::makeCluster(2, type = "FORK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

# parallel::mclapply
res <- parallel::mclapply(
 1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

# BiocParallel
BiocParallel::register(BiocParallel::MulticoreParam(2))
res <- BiocParallel::bplapply(
 1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"

So while using parallel::makeCluster(..., type = "PSOCK") clearly can be used for parallelization of getSheetNames,
I think it would still be great to use tempfile() instead of tempdir() to also allow parallelization with other commonly used approaches like parallel::mclapply or BiocParallel::bplapply.
Also, type = "PSOCK" seems to be slower than type = "FORK", at least in my test-case.

changes in temp directory related to #262

# openxlsx 4.2.7.1 * It's now possible to insert a hyperlinked image by passing a URL, relative or absolute file path, or mailto string to the new `address` parameter of `insertImage()`. # openxlsx 4.2.7 * Fixed warning on `dataValidation(..., type = "list")` ([#342](ycphs/openxlsx#342)) * Added optional argument to `loadWorkbook` to decide if empty/blank cells should be converted to NA_character_ (the default) or left blank as is * `saveWorkbook()` now succeeds when called after the user has set column widths for a range of columns (e.g. 1:2), saved the workbook, then set column widths for a new range that is inclusive of the previous one (e.g. 1:5) ([#493](ycphs/openxlsx#493)). ## Improvements * Improve detectDates ([#288](ycphs/openxlsx#288)) * Preserve window size and position, also `getWindowSize()` and `setWindowSize()` ([466](ycphs/openxlsx#466)) # openxlsx 4.2.6 * Fix external links ([#410](ycphs/openxlsx#410)) * Do not add unneccessary sheetPr node ([#409](ycphs/openxlsx#409)) * Add support for `namedRegion`s having dots and other special characters ([#338](ycphs/openxlsx#338)). * Add type blanks and not blanks to conditional formatting ([#311](ycphs/openxlsx#311)) # openxlsx 4.2.5 ## Fixes * `openxlsx_setOp()` now works with named list ([#215](ycphs/openxlsx#215)) * `loadWorkbook()` imports `inlineStr`. Values remain `inlineStr` when writing the workbook with `saveWorkbook()`. Similar `read.xlsx` and `readWorkbook` import `inlineStr`. * `read.xlsx()` no longer changes random seed ([#183](ycphs/openxlsx#183)) * fixed a regression that caused fonts to be read in incorrectly ([#207](ycphs/openxlsx#207)) * add option to save as read only recommended ([#201](ycphs/openxlsx#201)) * fixed writing hyperlink formulas ([#200](ycphs/openxlsx#200)) * `write.xlsx()` now throws an error if it doesn't have write permissions ([#190](ycphs/openxlsx#190)) * `write.xlsx()` now again uses the default of `overwrite = TRUE` for saving files ([#249](ycphs/openxlsx#249)) * `as.character.formula()` exported to warn about potential conflicts with other packages ([#312](ycphs/openxlsx#312), [#315](ycphs/openxlsx#315)) ## Improvements * `options()` are more consistently set in functions (see: [#289](ycphs/openxlsx#262)) * `Workbook$show()` no longer fails when called in a 0 sheet workbook([#240](ycphs/openxlsx#240)) * `read.xlsx()` again accepts `.xlsm` files ([#205](ycphs/openxlsx#205), [#209](ycphs/openxlsx#209)) * `makeHyperlinkString()` does no longer require a sheet argument ([#57](ycphs/openxlsx#57), [#58](ycphs/openxlsx#58)) * improvements in how `openxlsx` creates temporary directories (see [#262](ycphs/openxlsx#262)) * `writeData()` calls `force(x)` to evaluate the object before options are set ([#264](ycphs/openxlsx#264)) * `createComment()` now correctly handles `integers` in `width` and `height` ([#275](ycphs/openxlsx#275)) * `setStyles()` accepts `halign="justify"` ([#305](ycphs/openxlsx#305))

jmbarbone added a commit to jmbarbone/openxlsx that referenced this issue Sep 28, 2021

changes in temp directory related to ycphs#262

92924b5

jmbarbone mentioned this issue Sep 28, 2021

changes in temp directory related to #262 #263

Merged

JanMarvin added a commit that referenced this issue Sep 30, 2021

Merge pull request #263 from jmbarbone/262-temp-files

02f524a

changes in temp directory related to #262

JanMarvin closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parallelization of getSheetNames #262

Allow parallelization of getSheetNames #262

sgrote commented Sep 27, 2021

jmbarbone commented Sep 28, 2021

sgrote commented Sep 29, 2021 •

edited

Loading

Allow parallelization of getSheetNames #262

Allow parallelization of getSheetNames #262

Comments

sgrote commented Sep 27, 2021

jmbarbone commented Sep 28, 2021

sgrote commented Sep 29, 2021 • edited Loading

sgrote commented Sep 29, 2021 •

edited

Loading