Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoehmGCStackAllocator: ignore stack protection page #4264

Conversation

roberth
Copy link
Member

@roberth roberth commented Nov 16, 2020

Fix potential cause of #4178 (comment)

roberth added a commit to hercules-ci/nixpkgs that referenced this pull request Nov 16, 2020
@roberth roberth marked this pull request as ready for review November 17, 2020 14:35
@roberth
Copy link
Member Author

roberth commented Nov 17, 2020

Update: Boehm GC seems to be incompatible with coroutines. I haven't gotten to the bottom of this, but disabling GC during source addition seems like an acceptable workaround.
For details, see the inline comments.

@edolstra
Copy link
Member

Eh. That seems super ugly. It's probably better to go back to the old situation (i.e. not having GC roots on coroutine stacks).

roberth added a commit to hercules-ci/nixpkgs that referenced this pull request Nov 17, 2020
@roberth
Copy link
Member Author

roberth commented Nov 17, 2020

Eh. That seems super ugly.

I agree. I wish it was simple.

It's probably better to go back to the old situation (i.e. not having GC roots on coroutine stacks).

What would that old situation be?

I know #4030 introduced a case of evaluation in coroutines but #3931 predates that.
For #4030 this can be accomplished by enumerating the files before dumping to the daemon, but that doesn't attempt to fix the problem causing #3931.

It seems likely that the mere existence of a coroutine can crash the collector. The gdb stack traces all point to the collector code that finds roots in the stacks of all threads.

These seem to be the options to move forward:

  1. use threads instead of coroutines. This could be a drop-in replacement, but probably has a slight performance cost.
  2. use callbacks instead of coroutines. A lot of work, less readable, better performance.
  3. investigate precisely why the collector crashes and fix the real issue.
  4. figure out what else might cause nixFlakes's nix-build segfaults in boehm-gc #3931. Might lead to another case of eval on coroutine or some new insight.
  5. try to avoid GC allocations while any coroutines are running, which we won't be able to guarantee; crash will resurface.
  6. try to avoid GC allocations while any coroutines are running and disable the collector while we know a coroutine exists. Like previous but avoids crash. Instead causes a space leak while any coroutine exists. (this PR + enumerate files before dumping)
  7. something else?

I hope this can be fixed soon. 1 and 6 seem to be the shortest paths to a Nix that doesn't crash.

@Ericson2314
Copy link
Member

Honestly, we might consider replacing coroutines with plain old threads?

@Ma27
Copy link
Member

Ma27 commented Nov 24, 2020

Yesterday I debugged a nixops deployment where the evaluation resulted in a segfault (that was hidden by nixops not showing the output). Finally, a minimal NixOS configuration to reproduce the problem with nix (Nix) 2.4pre20201118_79aa7d9 was this:

{
  minimal_segfault_example = { pkgs, lib, config, ... }: {
    environment.etc."foo".text = "${builtins.path {
      path = pkgs.path;
      filter = pkgs.lib.cleanSourceFilter;
      name = "nixpkgs-${pkgs.lib.version}";
    }}";
  };
}

Long story short, this PR appears to solve the problem. In case anybody else stumbles upon this issue as well, the workaround is quite easy, just do nix-shell -p nix and re-run nixops deploy.

Unfortunately I didn't closely follow the changes related to that, so I don't have an actual opinion on how to proceed here.

cc @lheckemann

@jonringer
Copy link
Contributor

For others who land on this page.

Until there's a proper fix, you may try adding GC_DONT_GC=1 to the command.

GC_DONT_GC=1 nix build ...

this seems to allow IFD to complete successfully for me

dhess pushed a commit to hackworthltd/hacknix that referenced this pull request Jan 2, 2021
dhess pushed a commit to hackworthltd/hacknix that referenced this pull request Jan 3, 2021
dhess pushed a commit to hackworthltd/hacknix that referenced this pull request Jan 3, 2021
@stale
Copy link

stale bot commented Jun 2, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Jun 2, 2021
@dhess
Copy link

dhess commented Jun 2, 2021

I suspect this is still a problem.

@stale stale bot removed the stale label Jun 2, 2021
@roberth
Copy link
Member Author

roberth commented Jun 2, 2021

We might have learned something more if we'd merged the first commit. We can still do that.

At least that's a proper fix for the problem where GC causes an apparent "stack overflow".


Crashes should be top priority, right? For me they are, but the lack of action from the maintainers of the project on this issue has made me follow their example. I regret that, but I'm glad I could avoid an alternative that would have been worse.

If someone could make a decision, that would be nice.

@Ma27
Copy link
Member

Ma27 commented Jun 2, 2021

My example from above still segfaults on 2.4 and evaluates fine on 2.3, so this is still a problem.

That said I'm starting to wonder if it was a good idea to "disband the Nix core team" (RFC0044) in favor of the steering committee. The first team was solely "responsible" for the development of Nix itself, the second one is responsible for the development of the entire ecosystem. Having an unfixed crash on a - well - unstable, but heavily & officially used version of Nix for months - even though there's a working fix - is IMHO a symptom of that, just as a rather non-transparent roadmap is.

Don't get me wrong - I use nixUnstable myself and I'm fairly grateful for the work that has been done here, but I'm starting to ask myself if such a team that used to bring in a certain kind of structure is what's missing in the ongoing Nix 2.4 development.

But anways, cc @edolstra for the actual problem.

@roberth
Copy link
Member Author

roberth commented Jul 1, 2021

Done in #4944

@roberth roberth closed this Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants