Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving data after shutting down workers results in a segfault #306

Closed
wlandau opened this issue Sep 27, 2023 · 9 comments
Closed

Receiving data after shutting down workers results in a segfault #306

wlandau opened this issue Sep 27, 2023 · 9 comments
Labels

Comments

@wlandau
Copy link
Contributor

wlandau commented Sep 27, 2023

The following clustermq-only reprex is a simplified version of what targets is trying to do. (I omit w$cleanup() to test the w$send_shutdown().) Not every run segfaults, but many runs do.

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
  result <- w$recv()
  if (!is.null(result)) {
    message("done task ", result)
    done <- c(done, result)
    running <- setdiff(running, result)
  }
  if (length(running) < 2L && length(queue) > 0L) {
    next_task <- queue[1L]
    message("send task ", next_task)
    queue <- queue[-1L]
    running <- c(running, next_task)
    w$send(cmd = index, index = next_task)
  } else if (length(queue) > 0L) {
    w$send_wait()
  } else {
    w$send_shutdown()
  }
}

On a segfault, the error log of the worker reads:

2023-09-25 06:34:27.520023 | Master: tcp://haggunenon:7807
2023-09-25 06:34:27.521489 | connecting to: tcp://haggunenon:7807
2023-09-25 06:34:27.581130 | > call 1 (0.007s wait)
2023-09-25 06:34:27.637110 | > call 2 (0.003s wait)
2023-09-25 06:34:27.709159 | > call 3 (0.020s wait)
2023-09-25 06:34:27.761167 | > call 4 (0.003s wait)
Error in w$poll() : Unexpected peer disconnect

I am using Ubuntu for this test. (On Mac OS, as I have said, w$recv() hangs in a much simpler example.)

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/landau/R/R-4.3.0/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.9.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0   R6_2.5.1         tools_4.3.0      rstudioapi_0.14 
[5] Rcpp_1.0.11      codetools_0.2-19
@wlandau wlandau changed the title Trouble using the new 0.9.0 interface Segfaults using the new 0.9.0 interface Sep 27, 2023
@wlandau
Copy link
Contributor Author

wlandau commented Sep 27, 2023

@mschubert, you had suggested in #303 (comment) that I post a new issue to follow up on specific problems using #303, so I hope this helps.

@luwidmer
Copy link

luwidmer commented Sep 27, 2023

I just tried @wlandau's example (I added a message() before the send shutdown for clarity) on Windows, and it deterministically dies with an assertion failure:

Rscript clustermq-090-test.R
send task 1
done task 1
send task 2
done task 2
send task 3
send task 4
done task 3
send task 5
done task 4
send task 6
done task 5
send task 7
done task 6
send task 8
done task 7
send task 9
done task 8
send task 10
done task 9
send_shutdown
Assertion failed: check () (../zeromq-4.3.4/src/msg.cpp:414)

This is on R 4.3.0, see sessionInfo()

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=German_Switzerland.utf8  LC_CTYPE=German_Switzerland.utf8    LC_MONETARY=German_Switzerland.utf8
[4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.9.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0   tools_4.3.0      Rcpp_1.0.11      codetools_0.2-19

@mschubert
Copy link
Owner

mschubert commented Oct 9, 2023

You seem to have a bug in your example code, where you keep trying to receive data from workers after they are all shut down (loop goes to 100, tasks go to 10).

Minimal code to reproduce the same behavior:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(1L, log_worker = TRUE)
w$recv()
w$send_shutdown()
w$recv() # invalid vector index

However, this should throw an error in R, not crash the session.

@mschubert mschubert added the bug label Oct 9, 2023
@mschubert mschubert changed the title Segfaults using the new 0.9.0 interface Receiving data after shutting down workers results in a segfault Oct 9, 2023
@wlandau
Copy link
Contributor Author

wlandau commented Oct 10, 2023

You seem to have a bug in your example code, where you keep trying to receive data from workers after they are all shut down (loop goes to 100, tasks go to 10).

Hmm... I tried to fix the original example to avoid calling a shutdown too many times:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
active <- 2L
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
  result <- w$recv()
  if (!is.null(result)) {
    message("done task ", result)
    done <- c(done, result)
    running <- setdiff(running, result)
  }
  if (length(running) < 2L && length(queue) > 0L) {
    next_task <- queue[1L]
    message("send task ", next_task)
    queue <- queue[-1L]
    running <- c(running, next_task)
    w$send(cmd = index, index = next_task)
  } else if (length(queue) > 0L) {
    w$send_wait()
  } else if (active > 0L) {
    w$send_shutdown()
    active <- active - 1L
  }
}

It hung for several minutes without printing any messages to the R console. The log files show:

2023-10-10 15:51:44.776181 | Master: tcp://CENSORED:9786
2023-10-10 15:51:44.779649 | connecting to: tcp://CENSORED:9786
Error : Connection failed after 10001 ms

and

2023-10-10 15:51:44.776096 | Master: tcp://CENSORED:9786
2023-10-10 15:51:44.779632 | connecting to: tcp://CENSORED:9786
Error : Connection failed after 10001 ms

I used the CRAN version because I could not compile the development version.

remotes::install_github("mschubert/clustermq")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo mschubert/clustermq@HEAD
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/libzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/libzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/libzmq'...
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/cppzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/cppzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/cppzmq'...
── R CMD build ──────────────────────────────────────────────────────────────────────────
✔  checking for file/private/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T/RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/DESCRIPTION...preparingclustermq: (1.1s)
✔  checking DESCRIPTION meta-information ...cleaning srcrunningcleanup’
─  checking for LF line-endings in source and make files and shell scripts (1.1s)
─  checking for empty or unneeded directories (2s)
   Removed empty directoryclustermq/src/libzmq/build_qnx/nto/aarch64/leRemoved empty directoryclustermq/src/libzmq/build_qnx/nto/aarch64Removed empty directoryclustermq/src/libzmq/build_qnx/nto/x86_64/oRemoved empty directoryclustermq/src/libzmq/build_qnx/nto/x86_64Removed empty directoryclustermq/src/libzmq/build_qnx/ntoRemoved empty directoryclustermq/src/libzmq/builds/openwrt’
─  buildingclustermq_0.9.0.tar.gz* installing *source* packageclustermq...
** using staged installation
* no system libzmq found -> using bundled libzmq
autoreconf: export WARNINGS=
autoreconf: Entering directory '.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I config --force -I config
m4:configure.ac:9: ERROR: end of file in string
autom4te: error: /opt/homebrew/opt/m4/bin/m4 failed with exit status: 1
aclocal: error: /opt/homebrew/Cellar/autoconf/2.71/bin/autom4te failed with exit status: 1
autoreconf: error: aclocal failed with exit status: 1
autogen.sh: error: autoreconf exited with status 1
./configure: line 61: die: command not found
./configure: line 64: ./configure: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
ERROR: configuration failed for packageclustermq* removing/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermq* restoring previous/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermqWarning messages:
1: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/file16e913101b078/clustermq_0.9.0.tar.gzhad non-zero exit status
2: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/file16e913101b078/clustermq_0.9.0.tar.gzhad non-zero exit status

@mschubert
Copy link
Owner

mschubert commented Oct 10, 2023

Ah, I'm rewriting the configure script and this will have some rough edges until everything is done.

Can you try 5612364 from Github?

The "unable to connect" on macOS is a mystery to me: see #311. But this happens on CI too.

@wlandau
Copy link
Contributor Author

wlandau commented Oct 11, 2023

Thanks for working on this. I got a similar compilation error:

> remotes::install_github("mschubert/clustermq", ref = "5612364c52f17ba98b241a3f1f7e067c02bad3fe")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo mschubert/clustermq@5612364c52f17ba98b241a3f1f7e067c02bad3fe
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/libzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/libzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/libzmq'...
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/cppzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/cppzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/cppzmq'...
── R CMD build ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file/private/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T/Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/DESCRIPTION...preparingclustermq: (1.6s)
✔  checking DESCRIPTION meta-informationcleaning srcrunningcleanup’
─  checking for LF line-endings in source and make files and shell scripts (549ms)
─  checking for empty or unneeded directories (2.1s)
   Removed empty directoryclustermq/src/libzmq/build_qnx/nto/aarch64/leRemoved empty directoryclustermq/src/libzmq/build_qnx/nto/aarch64Removed empty directoryclustermq/src/libzmq/build_qnx/nto/x86_64/oRemoved empty directoryclustermq/src/libzmq/build_qnx/nto/x86_64Removed empty directoryclustermq/src/libzmq/build_qnx/ntoRemoved empty directoryclustermq/src/libzmq/builds/openwrt’
─  buildingclustermq_0.9.0.tar.gz* installing *source* packageclustermq...
** using staged installation
sed: include/zmq_utils.h.orig: No such file or directory
autogen.sh: error: could not find autoreconf.  autoconf and automake are required to run autogen.sh.
./configure: line 35: die: command not found
./configure: line 38: ./configure: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
ERROR: configuration failed for packageclustermq* removing/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermqWarning messages:
1: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/filedea533c9091/clustermq_0.9.0.tar.gzhad non-zero exit status
2: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/filedea533c9091/clustermq_0.9.0.tar.gzhad non-zero exit status

@mschubert
Copy link
Owner

autoconf and automake are required to run autogen.sh

I see now: coreutils and automake are required to compile from Github

@wlandau
Copy link
Contributor Author

wlandau commented Oct 11, 2023

I installed creutils and automake, and I still saw a compilation error with sed: include/zmq_utils.h.orig: No such file or directory. Maybe #312 will help?

@mschubert
Copy link
Owner

It's a typo in the configure script that's been fixed in the current HEAD (but it didn't stop compilation on GHA CI, so I'm not sure why it did for you)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants