Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocked nix-daemon zombies on darwin #3294 #6052

Merged

Conversation

roberth
Copy link
Member

@roberth roberth commented Feb 6, 2022

This changes the representation of the interrupt callback list to be safe to use during interrupt handling.

Holding a lock while executing arbitrary functions is something to avoid in general, because of the risk of deadlock.

Such a deadlock occurs in #3294 where ~CurlDownloader tries to deregister its interrupt callback.

This happens during what seems to be a triggerInterrupt() by the daemon connection's MonitorFdHup thread. This bit I can not confirm based on the stack trace though; it's based on reading the code, so no absolute certainty, but a smoking gun nonetheless.

Fixes #3294

At first I've tested this with a simpler solution: to just copy the list of interrupt handlers before running them, however that would still be fragile, as that solution does not satisfy the requirement that inserts/deletions to the interrupt handler list become effective immediately, with disastrous consequences if/when code depends on this for lifetime safety.

It is unclear to me why the deadlock would only occur on darwin, but I can speculate that there might be some nondeterminism involved that just happens to work out fine on Linux, but not on macOS.

This changes the representation of the interrupt callback list to
be safe to use during interrupt handling.

Holding a lock while executing arbitrary functions is something to
avoid in general, because of the risk of deadlock.

Such a deadlock occurs in NixOS#3294
where ~CurlDownloader tries to deregister its interrupt callback.

This happens during what seems to be a triggerInterrupt() by the
daemon connection's MonitorFdHup thread. This bit I can not confirm
based on the stack trace though; it's based on reading the code,
so no absolute certainty, but a smoking gun nonetheless.
@edolstra edolstra added this to the nix-2.7 milestone Feb 7, 2022
@Mic92
Copy link
Member

Mic92 commented Feb 11, 2022

We (me and @flokli) also saw deadlocks on NixOS recently that only got resolved by killing all processes in the systemd service.
Might be related to this. I will have a closer look next time this happens.

src/libutil/util.cc Outdated Show resolved Hide resolved
@edolstra edolstra merged commit e2422c4 into NixOS:master Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nix-daemon process leaking on nix-darwin Mac build machine
3 participants