Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ExecuteProcess return value and disable sigchld handler #2

Merged
merged 2 commits into from
Apr 19, 2022

Conversation

benlorenz
Copy link

This seems to work with julia master on macos.

Note: this PR targets the branch from oscar-system#726 and I merged some parts of master.

@codecov
Copy link

codecov bot commented Apr 14, 2022

Codecov Report

❗ No coverage uploaded for pull request base (mh/exec@1d9026d). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             mh/exec       #2   +/-   ##
==========================================
  Coverage           ?   78.04%           
==========================================
  Files              ?       48           
  Lines              ?     3690           
  Branches           ?        0           
==========================================
  Hits               ?     2880           
  Misses             ?      810           
  Partials           ?        0           

@benlorenz
Copy link
Author

benlorenz commented Apr 14, 2022

That crash in nightly is new, I re-ran the tests in the oscar-system repo on my branch and now it also crashes like this, but the previous attempt from yesterday was better, as only the GAP tests crashed instead of the julia tests: https://github.com/oscar-system/GAP.jl/actions/runs/2162169485/attempts/1

@fingolfin
Copy link
Owner

The diffs here were difficult to read as they brought in unrelated changes. I've now rebased the mh/exec branch, perhaps you can do the same with your branch based on it?

The crashes in nightly may or may not be related to oscar-system#801 but in any case they may be gone (for now) in GAP.jl master, simply because we are right now not using GAP_pkg_juliainterface_jll but rather compile that code from scratch (this will be the case automatically until GAP_pkg_juliainterface_jll is updated to match the latest C source changes). We'll see.

It might not be enough, though, in the end it could be necessary to rebuild libjulia_jll to match recent Julia master changes (I've not yet checked that at all!) and then of course also rebuild GAP_jll and other of our JLLs using it... They keep breaking their kernel API / ABI sigh

@benlorenz
Copy link
Author

I rebased the branch, sorry about the unrelated changes, I needed some updates to make it work at all with nightly but wanted to avoid the latest changes which were incompatible with the Oscar master. At least on ubuntu the julia-nightly crash has disappeared (https://github.com/oscar-system/GAP.jl/runs/6023028755?check_suite_focus=true ) and all tests were successful, macos is queued.

@benlorenz
Copy link
Author

Tests here are fine, but the latest julia nightly did indeed break existing binaries, e.g. CxxWrap fails to precompile when re-running the Polymake.jl tests: https://github.com/oscar-system/Polymake.jl/runs/6023944340?check_suite_focus=true#step:6:298

Copy link
Owner

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Esp. for my stupid res vs. res.status bug sigh.

But unfortunately it is more complicated than that: First off, we indeed perform no tests for ExecuteProcess at all, some should be added to make sure it works.

But more seriously, ExecuteProcess is only one of two (resp. three, depending on what you count) GAP mechanisms for loading subprocesses, and it is not even the one that needs the signal handler. The other approach are the so-called "iostreams": where ExecuteProcess is about starting a subprocess and waiting until it terminates, "iostreams" are more advanced and allow communicating with a child process running in the background. These also need to be replaced, in the same way as ExecuteProcess. And that then of course also needs to be tested... sigh
(The third way is via a GAP package called "io", which among other things has a ton of low-level wrappers for all kinds of POSIX APIs ... including fork and execve

So a possible plan would be:

  • add some tests for all this functionality
  • once those are in place, start to replace it (the parts in GAP itself, at least) gradually
  • finally rebuild GAP_jll with the builtin-features
  • (and perhaps eventually also patch the GAP package io, though it is unclear to me how that should reasonably work; maybe more realistic is to adjust packages using it to use slightly higher level APIs instead... sigh)

@@ -96,6 +96,13 @@ function initialize(argv::Vector{String})
# TODO: turn this into a proper libgap API
unsafe_store!(cglobal((:SyLoadSystemInitFile, libgap), Int64), 0)

# remember sigchld handler
if Sys.isbsd() || Sys.islinux()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAP.jl only runs on POSIX anyway, so this check shouldn't be necessary:-)

if Sys.isbsd() || Sys.islinux()
sigchld = Sys.isbsd() ? 20 : 17
# nullptr == SIG_DFL
prev_sigchld_hdl = ccall(:signal, Ptr{Cvoid}, (Cint, Ptr{Cvoid}), sigchld, Ptr{Cvoid}(0))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually a handler set? Or does it return NULL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it will be NULL, but there might be other software as well that sets a sigchld handler so to be safe I wanted to reinstate whatever was set before GAP replaces it instead of the default.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure, I agree this is the right way to do it. I was just wondering

@fingolfin
Copy link
Owner

Oh, there are a bunch of GAP packages relying on each of these mechanisms. So one way to "test" things is to try out stuff in those packages. Though I think a few hand written dedicated tests on the long run will be more helpful

@benlorenz
Copy link
Author

But unfortunately it is more complicated than that: First off, we indeed perform no tests for ExecuteProcess at all, some should be added to make sure it works.

The package installation tests do use the ExecuteProcess and they did fail without my fix, when trying this I had some print statements in the julia to check what it was executing and to check the return code.

@benlorenz
Copy link
Author

A different approach would be to just set the sigchld handler directly before creating a subprocess and restoring it when it is complete, this is what julia does on linux and what it did on macos before the change to kevent. (And also what we do in polymake now.)

But maybe we can first try if just removing the waitpid(-1) loop solves the issue on nightly? My hope is that this would be quicker than all the other replacements, though I am not really sure what happens with the kevent if there is a sigchld handler that just ignores that pid.

@fingolfin
Copy link
Owner

Ahhh right, I forgot that of course PackageManager runs external processes.

Note that the ExecuteProcess function really doesn't need the GAP SIGCHLD handler at all. To the contrary: it stores the current handler, then sets a default handler, and at the end restores all. See https://github.com/gap-system/gap/blob/04ef6d7a8ce59e988f76cc912707f1e1a62e3244/src/sysfiles.c#L2940-L2951 and surrounding code.

Getting rid of the waitpid(-1) loop sounds like a good idea anyway (also when GAP is not built with Julia), though I hope @ChrisJefferson will chime in just because he worked on that code extensively a couple years ago and might remember what the reason for it was, if any

Finally replacing the GAP iostream code by a Julia implementation shouldn't be that hard in principle (I expect that the most annoying bit will be handling errors, converting between Julia and GAP errors... we'll see)

@fingolfin fingolfin merged commit 44d4081 into fingolfin:mh/exec Apr 19, 2022
@fingolfin fingolfin deleted the bl/sigchld branch April 19, 2022 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants