Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and warn on stack overflow #6928

Merged
merged 2 commits into from
Oct 14, 2018
Merged

Conversation

damaxwell
Copy link
Contributor

This PR fleshes out PoC #6376.

Closes #271.

src/signal.cr Show resolved Hide resolved
src/signal.cr Outdated
# amount larger than a typical stack frame.
stack_top = Pointer(Void).new([email protected] - 4096)
if stack_top <= addr < Fiber.current.@stack_bottom
LibC.dprintf 2, "Potential stack overflow (e.g., infinite or very deep recursion)\n"
Copy link
Member

@asterite asterite Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just say "Stack overflow" (with or without that infinite/recursion hint), but I would totally remove the "Invalid memory access" message from above. The user should see just one of those messages, not both. And even if it's a potential stack overflow but not certainly a stack overflow, I think the changes of that a super low so I'd rather have the user see "Stack overflow" and not confuse him with potential things (i.e. "Potential stack overflow? Well, maybe that's not it...").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I would raise a StackOverflowError when this happen. Right now it's a bit slow to show the full trace for it, but we should improve that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but I don't think we can raise an exception from the SIGSEGV handler, because it runs on its own altstack, not the stack that segfaulted (it's full).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's true... When I tried it I watched it print "Unhandled exception: ...", but I can't seem to rescue it. So printing an error and exiting is fine. After all, a stack overflow is probably a bug in the app and it usually happens when you start it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive my overly cautious error messages. I'll go out on a limb and claim (with certainty!) a stack overflow.

src/signal.cr Outdated
stack_top = Pointer(Void).new([email protected] - 4096)

if stack_top <= addr < Fiber.current.@stack_bottom
LibC.dprintf 2, "Stack overflow (invalid memory access at address 0x%lx)\n", addr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps expand Stack overflow part into Stack overflow: infinite or too deep recursion (copied from previous PR by @ysbaddaden)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there originally (edited a bit) but I took @asterite's comment

I would just say "Stack overflow" (with or without that infinite/recursion hint)

as subtext that he felt the hint was superfluous. Otherwise, why bring up "with or without"? So I nixed it. I can revert to the original language if there is a general preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I wouldn't mix "stack ovefflow" with "invalid memory access" in the same message. Most developers don't know or don't care about this fact. Just "Stack overflow" is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. In @ysbaddaden 's original work he also reported the address of the seg fault being handled, even when it was a stack overflow. Easy enough to not do this and shorten the error message yet further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, don't listen to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. ;-)

But your original point is very well taken: don't bury the main point (stack overflow) on the second line of the error message. That's fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to settle on something, then let's change the message once. My personal opinion:

  • I'd keep the infinite or too deep recursion hint. It gives an explanation for the blunt stack overflow, which otherwise can be a little intimidating when not familiar with it.
  • I'd drop the invalid memory access at 0x... since it's 99% likely to be a stack overflow, and debugging with gdb/lldb will give access to the actual address for the remaining 1%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consensus seemed to be reached. The error message has the hint, but no mention of the memory address.

stack_size = rlim.rlim_cur
@stack = Pointer(Void).new(@stack_bottom.address - stack_size)
end
{% end %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work for the main thread, but is it correct for created threads? I don't know what's the default stack size for them. Maybe we should use pthread_getattr_np(3) and pthread_attr_getstacksize(3) for created thread?

Copy link
Contributor

@RX14 RX14 Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about that for now until after threaded GC works. Otherwise it's stalling the PR. I'd like this in 0.27.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but maybe we can add a comment about it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a TODO: would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO's with notes added.

@damaxwell damaxwell force-pushed the stack-overflow branch 2 times, most recently from b00a070 to e578379 Compare October 13, 2018 01:56
Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is good, just one more simple spec.

@@ -220,3 +220,22 @@ describe "at_exit" do
OUTPUT
end
end

describe "seg fault" do
it "detects stack overflow" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a spec which stack overflows the main stack, not just a fiber stack?

How long does this spec run for (time)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each spec in kernel_spec.cr costs about 3 seconds apiece to run for a total of 36 seconds on my laptop. The one I added is no different. So I was leery about adding too many of these.

In my actual testing on the various platforms I used the following:

  • Random pointer deref does not trigger stack overflow
  • Trivial infinite recursion on main stack triggers stack overflow
  • Infinite recursion on main stack with modest stack allocations triggers stack overflow
  • Trivial infinite recursion on fiber stack triggers stack overflow (this became the official spec)
  • Infinite recursion on fiber with modest stack allocation triggers stack overflow

All of these take about the same amount of time, except on one of the BSDs (freebsd, I think) the default stack size is gargantuan and ulimit didn't seem to work to restrict it. So testing for the main stack took an inordinate amount of time (a minute or two). This led to the choice of using the fibre stack for the official suite if there was going to be just one spec.

Please let me know which of these you'd like me to add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damaxwell I'd go for 1, 3 and 5. Speeding up the stack overflow by allocating a 512 byte staticarray should help the speed of the test on the BSDs, and testing main and fiber stacks is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new specs are added, @RX14. It was FreeBSD that seems to have a default stack size of 0.5G, but ulimit works just fine to limit it; not sure what I had done wrong. It takes 2 minutes (in a VM) to exhaust a stack that big. I left a comment in the spec to note the potential trouble and remedy.

@RX14 RX14 added this to the 0.27.0 milestone Oct 14, 2018
@RX14 RX14 merged commit aca0aca into crystal-lang:master Oct 14, 2018
@damaxwell damaxwell deleted the stack-overflow branch October 14, 2018 15:17
@ysbaddaden
Copy link
Contributor

This spec fails systematically on Ubuntu 14.04 for me when running make std_spec:

  1) seg fault detects stack overflow on the main stack

     Failure/Error: error.should contain("Stack overflow")
       Expected:   "Invalid memory access (signal 11) at address 0x7ff92e7c2f08 <..snip...>"
       to include: "Stack overflow"

     # spec/std/kernel_spec.cr:249

@damaxwell
Copy link
Contributor Author

Thanks. It was tested under Ubuntu 16.04 and passes there. I'll get 14.04 up and running and see what's different.

@damaxwell
Copy link
Contributor Author

@ysbaddaden: the failure happens because gmake 3.81 sets the stack rlimit to RLIM_INFINITY=-1. So our computation of the stack top is wrong. Later versions of gmake reset the stack rlimit to something sane before forking out child processes; see this discussion. I'm looking into a fix.

@asterite
Copy link
Member

asterite commented Nov 2, 2018

@damaxwell @ysbaddaden Maybe a bit late, but thank you so much for this! Segfaults that are stack overflows are probably the most common errors reported here, and now that traffic will probably be reduced a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants