Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected: read.xlsx alters random number generation sequence #183

Closed
ugobob99 opened this issue Apr 29, 2021 · 9 comments
Closed

unexpected: read.xlsx alters random number generation sequence #183

ugobob99 opened this issue Apr 29, 2021 · 9 comments

Comments

@ugobob99
Copy link

Expected Behavior

Loading an excel file via read.xlsx() does not affect the sequence of generated random number

Actual Behavior

Using read.xlsx() results in a change in the random number sequence.

Steps to Reproduce the Problem

(please attach an example xlsx file if possible)

The problem is independent of the specific excel file loaded, so no file is attached.
Interstingly, the issue does not appear with R 3.4.1 and openxlsx 4.0.17
The following transcript, run in a fresh R session, documents the behaviour:

1> library(openxlsx)
2> set.seed(100)
3> runif(5)
     [1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928
4> set.seed(100)
5> x <- read.xlsx("an excel file.xlsx")
6> runif(5)
      [1] 0.2046122 0.3575249 0.3594751 0.6902905 0.5358112
      # the result above should be identical to that of line 3, but is not

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252    LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    

attached base packages:
[1] graphics  grDevices stats     utils     methods   base     

other attached packages:
[1] openxlsx_4.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          lattice_0.20-38     mvtnorm_1.1-1       corpcor_1.6.9       gtools_3.8.2        SuppDists_1.1-9.5  
 [7] MASS_7.3-51.5       sas7bdat_0.5        grid_3.6.3          nlme_3.1-144        zip_2.1.1           stringi_1.5.3      
[13] ismev_1.42          Matrix_1.2-18       lhs_1.1.1           splines_3.6.3       tools_3.6.3         numDeriv_2016.8-1.1
[19] compiler_3.6.3      mgcv_1.8-31         evd_2.3-3        
  • Version of openxlsx: ‘4.2.3’
  • Version of R: 3.6.3
  • Version of RStudio: 1.3.1093
  • OS: Windows 10, but the same unexpected behaviour appears under Linux (redhat)
@JanMarvin
Copy link
Collaborator

That is expected with calls from and to c++ (in our case Rcpp). Iirc there is a way to stop this (get and set rngscope), but I'd say that's something the user should expect and nothing special in openxlsx. The same happens with readr on my system (needed a package using c++ code involved).

@ugobob99
Copy link
Author

@JanMarvin thanks for the comment.
I admit I do not have experience with Rcpp nor with calling c++ from R.
Do you mean that any call from/to c++ will alter the sequence?
Or that any read of an excel file uses the random number generator, and therefore alters the sequence?

@JanMarvin
Copy link
Collaborator

During the lunch break I couldn't reproduce this behavior with a minimal example (might be, that the Rcpp guys changed things on their end; there was chatter in the Rcpp repo a while ago and now I could not reproduce this SO). Chances are, that their might be things on our end (the c++ code is included in a rather unusual way) interfering with the seed. But still, I'd say that's up to the user to figure and happens most likely with many different packages embedding c++ code. As a rule of thumb don't run interfering code.

To answer your question: I think that the seed is altered by the c++ call. Therefore your random numbers are different. To my understanding, and I wouldn't call myself an expert on the topic, this is due to the call to the c++ functions and how they handle the RNGscope.

@ycphs
Copy link
Owner

ycphs commented Jun 25, 2021

read.xlsx creates a temporary directory.

This results in a change of seed.

set.seed(5)
rnorm(1)
tempdir() #changes seed
rnorm(1)

set.seed(5)
rnorm(1)

I hope this explains the behaviour. Maybe this is a topic for R-Core.

@JanMarvin
Copy link
Collaborator

Good catch! Question is: do we want to preserve the seed? I'd still be in favor of leaving this to the user.

set.seed(123)
x <- .Random.seed

ref <- rnorm(123)
identical(ref, rnorm(123)) # FALSE

.Random.seed <- x
identical(ref, rnorm(123)) # TRUE

@ycphs
Copy link
Owner

ycphs commented Jun 26, 2021

Leaving it to the user seems a good idea, which could be handled by the options of the package. Every function which creates a temporary directory or file would need a correction for the random seed. To check that we would need to create also loads of new tests to check for the seed too.

@ugobob99
Copy link
Author

I may be wrong, but the documentation for tempdir() states that is simply returns the per-session temporary directory which is created before the interpreter is started.
On my installation, this is comfirmed:

> set.seed(5) 
> runif(1) 
[1] 0.2002145 
> set.seed(5) 
> tempdir() 
[1] "/tmp/RtmpaUl2EE" 
> runif(1) [1] 0.2002145
tempdir() 
[1] "/tmp/RtmpaUl2EE" 

So the problem is not tempdir() per se, but the fact that a temporary directory or file is created by read.xlsx() which uses a random name and is not presumably using tempfile(), which is careful not to disturb the random seed, according to this experiment:

> set.seed(5)
> runif(1)
[1] 0.2002145
> set.seed(5)
> tempfile()
[1] "/tmp/RtmpaUl2EE/file1ffce4b88902b"
> runif(1)
[1] `0.2002145
``


@jmbarbone
Copy link
Contributor

I guess the line in question might be this one?

openxlsx/R/readWorkbook.R

Lines 142 to 143 in 89ce583

xmlDir <-
file.path(tempdir(), paste0(sample(LETTERS, 10), collapse = ""), "_excelXMLRead")

@ycphs ycphs closed this as completed in 6abdeeb Jul 11, 2021
ycphs added a commit that referenced this issue Jul 11, 2021
@dernst
Copy link

dernst commented Oct 25, 2022

Just a comment because we had a problem recently which was also caused by this code segment. We had multiple child processes reading xls files (via mclapply). The child processes have the same tempdir() and start off with the same random number generator state. That is above code snippet would resolve to the same directory name (xmlDir) for each child process, each of which is trying to unzip an excel file into the same directory. This causes a conflict of course.

So anyhow thanks for the fix :) Using tempfile() should be the correct fix assuming R uses tempnam(3) or something similar which make sure that the tempfiles don't get re-used between different processes.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jan 4, 2025
# openxlsx 4.2.7.1

* It's now possible to insert a hyperlinked image by passing a URL,
  relative or absolute file path, or mailto string to the new
  `address` parameter of `insertImage()`.

# openxlsx 4.2.7

* Fixed warning on `dataValidation(..., type = "list")`
  ([#342](ycphs/openxlsx#342))

* Added optional argument to `loadWorkbook` to decide if empty/blank
  cells should be converted to NA_character_ (the default) or left
  blank as is

* `saveWorkbook()` now succeeds when called after the user has set
  column widths for a range of columns (e.g. 1:2), saved the workbook,
  then set column widths for a new range that is inclusive of the
  previous one (e.g. 1:5)
  ([#493](ycphs/openxlsx#493)).

## Improvements

* Improve detectDates
  ([#288](ycphs/openxlsx#288))

* Preserve window size and position, also `getWindowSize()` and
  `setWindowSize()`
  ([466](ycphs/openxlsx#466))

# openxlsx 4.2.6

* Fix external links
  ([#410](ycphs/openxlsx#410))

* Do not add unneccessary sheetPr node
  ([#409](ycphs/openxlsx#409))

* Add support for `namedRegion`s having dots and other special
  characters ([#338](ycphs/openxlsx#338)).

* Add type blanks and not blanks to conditional formatting
  ([#311](ycphs/openxlsx#311))

# openxlsx 4.2.5

## Fixes

* `openxlsx_setOp()` now works with named list
  ([#215](ycphs/openxlsx#215))

* `loadWorkbook()` imports `inlineStr`. Values remain `inlineStr` when
  writing the workbook with `saveWorkbook()`. Similar `read.xlsx` and
  `readWorkbook` import `inlineStr`.

* `read.xlsx()` no longer changes random seed
  ([#183](ycphs/openxlsx#183))

* fixed a regression that caused fonts to be read in incorrectly
  ([#207](ycphs/openxlsx#207))

* add option to save as read only recommended
  ([#201](ycphs/openxlsx#201))

* fixed writing hyperlink formulas
  ([#200](ycphs/openxlsx#200))

* `write.xlsx()` now throws an error if it doesn't have write
  permissions ([#190](ycphs/openxlsx#190))

* `write.xlsx()` now again uses the default of `overwrite = TRUE` for
  saving files ([#249](ycphs/openxlsx#249))

* `as.character.formula()` exported to warn about potential conflicts
  with other packages
  ([#312](ycphs/openxlsx#312),
  [#315](ycphs/openxlsx#315))

## Improvements

* `options()` are more consistently set in functions (see:
  [#289](ycphs/openxlsx#262))

* `Workbook$show()` no longer fails when called in a 0 sheet
  workbook([#240](ycphs/openxlsx#240))

* `read.xlsx()` again accepts `.xlsm` files
([#205](ycphs/openxlsx#205),
[#209](ycphs/openxlsx#209))

* `makeHyperlinkString()` does no longer require a sheet argument
  ([#57](ycphs/openxlsx#57),
  [#58](ycphs/openxlsx#58))

* improvements in how `openxlsx` creates temporary directories (see
  [#262](ycphs/openxlsx#262))

* `writeData()` calls `force(x)` to evaluate the object before options
  are set ([#264](ycphs/openxlsx#264))

* `createComment()` now correctly handles `integers` in `width` and
  `height` ([#275](ycphs/openxlsx#275))

* `setStyles()` accepts `halign="justify"`
  ([#305](ycphs/openxlsx#305))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants