-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow parallelization of getSheetNames #262
Comments
I'm not familiar with BiocParallel but I don't have an issue trying to read with the library(openxlsx)
library(parallel)
packageVersion("openxlsx")
#> [1] '4.2.4.9000'
files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
write.xlsx(list(a = data.frame(1)), x)
}))
cl <- makeCluster(2)
res <- parSapply(cl, files, getSheetNames)
stopCluster(cl)
names(res) <- basename(names(res))
res
#> filed028503a5e45.xlsx filed0286ced491a.xlsx filed0285d1e260b.xlsx
#> "a" "a" "a"
#> filed0283073600e.xlsx filed02835787e93.xlsx filed0281e6a4f69.xlsx
#> "a" "a" "a"
#> filed028396510da.xlsx filed0283efb25a2.xlsx filed028127869b2.xlsx
#> "a" "a" "a"
#> filed028696535d4.xlsx
#> "a"
all(file.remove(files))
#> [1] TRUE Created on 2021-09-27 by the reprex package (v2.0.1) There is a line in Lines 3401 to 3404 in 0cc668b
Changing this could work, but I haven't been able to get the same issues. |
Hi Jordan, library(openxlsx)
library(parallel)
files <- sapply(1:10, function(x) tempfile(fileext = ".xlsx"))
invisible(sapply(files, function(x) {
write.xlsx(list(a = data.frame(1)), x)
}))
cl <- makeCluster(4, type = "FORK")
res <- parSapply(cl, files, getSheetNames)
# Error in checkForRemoteErrors(val) :
# 3 nodes produced errors; first error: cannot open file '/tmp/RtmpjScHsw/_excelXMLRead/_rels/.rels': No such file or directory
stopCluster(cl) The issue of using the same This demonstrates that only the first version uses distinct # parallel::makeCluster PSOCK
cl <- parallel::makeCluster(2, type = "PSOCK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpTMgTqX" "/tmp/RtmpTMgTqX" "/tmp/RtmpF6A2EO" "/tmp/RtmpF6A2EO"
# parallel::makeCluster FORK
cl <- parallel::makeCluster(2, type = "FORK")
res <- parallel::parSapply(cl, 1:4, tempdir)
parallel::stopCluster(cl)
res
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"
# parallel::mclapply
res <- parallel::mclapply(
1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"
# BiocParallel
BiocParallel::register(BiocParallel::MulticoreParam(2))
res <- BiocParallel::bplapply(
1:4, tempdir
)
unlist(res)
# [1] "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z" "/tmp/RtmpoEql3Z"
So while using |
changes in temp directory related to #262
# openxlsx 4.2.7.1 * It's now possible to insert a hyperlinked image by passing a URL, relative or absolute file path, or mailto string to the new `address` parameter of `insertImage()`. # openxlsx 4.2.7 * Fixed warning on `dataValidation(..., type = "list")` ([#342](ycphs/openxlsx#342)) * Added optional argument to `loadWorkbook` to decide if empty/blank cells should be converted to NA_character_ (the default) or left blank as is * `saveWorkbook()` now succeeds when called after the user has set column widths for a range of columns (e.g. 1:2), saved the workbook, then set column widths for a new range that is inclusive of the previous one (e.g. 1:5) ([#493](ycphs/openxlsx#493)). ## Improvements * Improve detectDates ([#288](ycphs/openxlsx#288)) * Preserve window size and position, also `getWindowSize()` and `setWindowSize()` ([466](ycphs/openxlsx#466)) # openxlsx 4.2.6 * Fix external links ([#410](ycphs/openxlsx#410)) * Do not add unneccessary sheetPr node ([#409](ycphs/openxlsx#409)) * Add support for `namedRegion`s having dots and other special characters ([#338](ycphs/openxlsx#338)). * Add type blanks and not blanks to conditional formatting ([#311](ycphs/openxlsx#311)) # openxlsx 4.2.5 ## Fixes * `openxlsx_setOp()` now works with named list ([#215](ycphs/openxlsx#215)) * `loadWorkbook()` imports `inlineStr`. Values remain `inlineStr` when writing the workbook with `saveWorkbook()`. Similar `read.xlsx` and `readWorkbook` import `inlineStr`. * `read.xlsx()` no longer changes random seed ([#183](ycphs/openxlsx#183)) * fixed a regression that caused fonts to be read in incorrectly ([#207](ycphs/openxlsx#207)) * add option to save as read only recommended ([#201](ycphs/openxlsx#201)) * fixed writing hyperlink formulas ([#200](ycphs/openxlsx#200)) * `write.xlsx()` now throws an error if it doesn't have write permissions ([#190](ycphs/openxlsx#190)) * `write.xlsx()` now again uses the default of `overwrite = TRUE` for saving files ([#249](ycphs/openxlsx#249)) * `as.character.formula()` exported to warn about potential conflicts with other packages ([#312](ycphs/openxlsx#312), [#315](ycphs/openxlsx#315)) ## Improvements * `options()` are more consistently set in functions (see: [#289](ycphs/openxlsx#262)) * `Workbook$show()` no longer fails when called in a 0 sheet workbook([#240](ycphs/openxlsx#240)) * `read.xlsx()` again accepts `.xlsm` files ([#205](ycphs/openxlsx#205), [#209](ycphs/openxlsx#209)) * `makeHyperlinkString()` does no longer require a sheet argument ([#57](ycphs/openxlsx#57), [#58](ycphs/openxlsx#58)) * improvements in how `openxlsx` creates temporary directories (see [#262](ycphs/openxlsx#262)) * `writeData()` calls `force(x)` to evaluate the object before options are set ([#264](ycphs/openxlsx#264)) * `createComment()` now correctly handles `integers` in `width` and `height` ([#275](ycphs/openxlsx#275)) * `setStyles()` accepts `halign="justify"` ([#305](ycphs/openxlsx#305))
Is your feature request related to a problem? Please describe.
It does not seem to be possible to run
openxlsx::getSheetNames
in parallel. Here's my code:Now,
getSheetNames(excel_files, parallel = TRUE)
sometimes throws an erroror
Describe the solution you'd like
My guess would be that each of the parallel tasks writes the same temporary file.
Maybe using a directory or a file based on
tempfile()
would be a solution.Describe alternatives you've considered
Parallelization does work with
readxl::excel_sheets
, but that is much slower thanopenxlsx::getSheetNames
.Another alternative is to not run it in parallel, but with a growing number of excel files this will be more time consuming.
The text was updated successfully, but these errors were encountered: