Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when repeatedly calling endpoint and compiled in release mode #790

Closed
timvisee opened this issue Oct 9, 2018 · 12 comments
Closed
Labels
no bug The reported bug was confirmed nonexistent

Comments

@timvisee
Copy link

timvisee commented Oct 9, 2018

I seem to be getting a segfault when repeatedly calling a Rocket endpoint, when running a build in release mode. I'm calling this endpoint form a browser using a axios GET request. I'm not quite sure whether Rocket itself is causing this segfault, as I'm not too familiar with debugging and solving these issues. Because of that I'm mostly guessing in the wild and am trying attempting various things, which I'll try to report here. Maybe it's caused by a different crate or by the latest Rust nightly, in which case I will escalate the issue.

Simplify endpoint/route
I attempted to simplify the Rocket endpoint as much as possible, by just returning a Json<bool>, and without doing any special logic inside the endpoint/route itself. The segfault still kept happening.
The endpoint that was still causing segfaults when calling it can be seen here.

System allocator
The project uses the default allocator. Switching to the system allocator using the following code does prevent the segfault from happening. Using this shouldn't however be the solution:

#![feature(alloc_system, allocator_api)]
extern crate alloc_system;
use alloc_system::System;
#[global_allocator]
static A: System = System;

Valgrind
I did run the project through valgrind with the default options, and got various different traces. I don't know whether these are useful as I'm not too familiar with these segfaults. Here are two of those traces however:

==25874== Memcheck, a memory error detector
==25874== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25874== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==25874== Command: ./target/release/cant-touch-this
==25874==
Initializing core...
Not loading templates, no file exists
🔧  Configured for production.
    => address: 0.0.0.0
    => port: 8000
    => log: critical
    => workers: 2
    => secret key: provided
    => limits: forms = 32KiB
    => keep-alive: 5s
    => tls: disabled
    => [extra] template_dir: "./res/templates"
🚀  Rocket has launched from http://0.0.0.0:8000
==25874==
==25874== Process terminating with default action of signal 11 (SIGSEGV)
==25874==  Bad permissions for mapped region at address 0x6600000
==25874==    at 0x3DFF98: je_tcache_bin_flush_small (tcache.c:0)
==25874==    by 0x3C1C6A: je_tcache_dalloc_small (tcache.h:419)
==25874==    by 0x3C1C6A: je_arena_sdalloc (arena.h:1499)
==25874==    by 0x3C1C6A: je_isdalloct (jemalloc_internal.h:1195)
==25874==    by 0x3C1C6A: je_isqalloc (jemalloc_internal.h:1205)
==25874==    by 0x3C1C6A: isfree (jemalloc.c:1921)
==25874==    by 0x14D8F1: core::ptr::drop_in_place (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x1536C5: cant_touch_this::web::server::rocket_route_fn_visualizer_points (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x287AB9: <F as rocket::handler::Handler>::handle (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x2A250A: rocket::rocket::Rocket::route_and_process (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x29FF15: <rocket::rocket::Rocket as hyper::server::Handler>::handle (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x28EA78: <hyper::server::Worker<H>>::handle_connection (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x2A6E07: hyper::server::listener::spawn_with::{{closure}} (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x2A4646: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x2A4C16: _ZN3std9panicking3try7do_call17h53241b36e4781d0eE.llvm.1788207313042385272 (in /home/timvisee/projects/cant-touch-this/target/release/cant-touch-this)
==25874==    by 0x3B9C09: __rust_maybe_catch_panic (lib.rs:102)
==25874==
==25874== HEAP SUMMARY:
==25874==     in use at exit: 310,727 bytes in 1,269 blocks
==25874==   total heap usage: 4,587 allocs, 3,318 frees, 710,817 bytes allocated
==25874==
==25874== LEAK SUMMARY:
==25874==    definitely lost: 0 bytes in 0 blocks
==25874==    indirectly lost: 0 bytes in 0 blocks
==25874==      possibly lost: 4,256 bytes in 14 blocks
==25874==    still reachable: 306,471 bytes in 1,255 blocks
==25874==                       of which reachable via heuristic:
==25874==                         stdstring          : 10,938 bytes in 243 blocks
==25874==         suppressed: 0 bytes in 0 blocks
==25874== Rerun with --leak-check=full to see details of leaked memory
==25874==
==25874== For counts of detected and suppressed errors, rerun with: -v
==25874== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
fish: “and valgrind ./target/release/c…” terminated by signal SIGSEGV (Address boundary error)

And two other traces (Gist).

Question
So, the question is of course. What is causing these segfaults, and how can we solve it? Is it caused by an issue in Rocket, or in another crate. Or is it my wrongdoing?

I sadly have no idea at all what the cause could be. I saw quite a few serde related entries for (de)serializing JSON before thus thought that serde was the problem. However, the issue kept occurring when I started returning a plain String.

Version and system information:

Type Version
Rocket v0.4-dev @ f857f81 (from git)
Host Ubuntu 18.04 (Linux 4.15.0-33-generic x86_64)
rustc rustc 1.31.0-nightly (423d81098 2018-10-08)
cargo cargo 1.31.0-nightly (ad6e5c003 2018-09-28)
My project timvisee/cant-touch-this#89af935

Additional notes

  • This does only occur in --release mode.
  • It seems to be happening since about two weeks (while I was using 46da03c)
  • I'm calling the endpoint each 50ms, so about 20 times a second.
  • It looks like 39 requests succeed after which the ~40th fails.
  • Returning a String (with "".into()) instead of a Json value doesn't prevent the segfault.
  • Normal (successful) requests aren't shown in the (valgrind) logs as I've set logging to critical.
timvisee added a commit to timvisee/cant-touch-this that referenced this issue Oct 9, 2018
@SergioBenitez
Copy link
Member

SergioBenitez commented Oct 9, 2018

Unfortunately we're missing some debug symbols which could tell us where the issue lies. Can you enable debug symbols in release mode and try running valgrind again? See this stackoverflow question for how.

@timvisee
Copy link
Author

timvisee commented Oct 9, 2018

@SergioBenitez Here is a valgrind trace with debug = true for the release profile:

https://gist.github.com/timvisee/07ce1fb24bd6d2014a07bada9e71e2fd

I noticed I can also enable verbose logging, and full leak checking for extra details. If that may be useful, please let me know because these logs seem to be quite big.

@SergioBenitez
Copy link
Member

Thanks! This may be related to rust-lang/rust#54478.

Please run valgrind with --keep-stacktraces=alloc-and-free --track-origins=yes.

@SergioBenitez SergioBenitez added the triage A bug report being investigated label Oct 10, 2018
@SergioBenitez SergioBenitez changed the title Segfault when repeatedly calling endpoint in release Segfault when repeatedly calling endpoint and compiled in release mode Oct 10, 2018
@SergioBenitez
Copy link
Member

SergioBenitez commented Oct 10, 2018

@timvisee I tried building your project to diagnose this, but it looks like it requires a native library libleap, which appears to be the Leap SDK. Unfortunately, the SDK is gated behind registration, so I'm not able to install it.

In any case, I took a closer look at the source of your application. There is no unsafe code from Rocket that gets runs on drop, and there's no unvetted unsafe code between the handler and the drop. As such, I'd be hard-pressed to find that Rocket is at-fault here, though I don't mean to discount it fully.

I tried replicating what you have in your source code, minus the parts that touch the SDK, starting a Linux (Debian 9) server on a 24 core machine, and ramming it with 100s of millions of requests to try to elicit a seqfault. Alas, no segfault.

I wonder if the use of the native library could be contributing here. One thing I would try is removing all of the calls to .manage() in the construction of Rocket to see if that affects anything, since it looks like you're managing structures that point to FFI structures.

In any case, once you're able to run valgrind with those parameters, we should have a pretty good idea of what's at fault here.

@timvisee
Copy link
Author

@SergioBenitez Thank you for attempting to find the problem so thoroughly.

I'm sure you're correct that Rocket shouldn't be the culprit here as it doesn't have any unsafe code that is relevant here.

The Leap SDK indeed requires user registration, and it is quite a challenge to install it properly anyway. If you're interested, I pulled the whole Leap SDK out of the project in the rocket-segfault-debug-no-leap branch, so you should be able to compile it now. The project should start a server providing a very basic web interface. This interface has a Visualize toggle, when enabled it initiates the repeated requests as you'll be able to see in your browsers network inspector. The project isn't really functional anymore but the segfault still occurs.

As I've mentioned. Fully removing the use of the leap-rs library (I believe this is the native library you are talking about) doesn't prevent the segfault from occurring. Commenting out the two .manage(...) calls I'm using doesn't solve the problem either.

One interesting finding is that the program doesn't seem to crash anymore when not running it through valgrind. I don't know whether that is normal. It's also quite interesting that you don't seem to be getting a segfault with a similar configuration, I wonder whether the same happens when running my project.

Anyhow, here are two different segfault traces with the --keep-stacktraces=alloc-and-free --track-origins=yes parameters you suggested (ran on f0e32cc6):
https://gist.github.com/timvisee/3d0ee795cda3b633a66d20a015d5dc4a

@jebrosen
Copy link
Collaborator

Your latest traces from valgrind look pretty bad -- jemalloc and the system allocator appear to be clashing with each other, which would explain why using the system allocator "fixes" everything. That said, I've had problems using valgrind in the past when jemalloc is involved so I'm not very confident in that analysis.

Would it be possible to compile to the x86_64-unknown-linux-musl target and see if it makes a difference, in debug and release mode? That might help isolate the issue to jemalloc or glibc's allocator.

@timvisee
Copy link
Author

@jebrosen First of all, thank you for looking into it.

Interesting. Are you suggesting that valgrind may be causing a different allocator to be used? Because the project doesn't seem to crash at this moment when running it directly (so not through valgrind), while previously it did.

I'll attempt to compile for the musl target soon.

@timvisee
Copy link
Author

timvisee commented Oct 11, 2018

A minor update. I'm currently testing the project on a different machine (running Manjaro Linux), which shows different problems.

First of all, the temporary patch for using the system allocator does not work on this system. It crashes anyway, no matter what allocator is used.

Also, the project seems to crash immediately (without sending any request to Rocket) on this machine when I have a Leap Motion device connected to the computer through an USB port, in the project where the leap library is still available. Disconnecting the device does indeed prevent the problem from happening on start. This build doesn't crash when doing repeating requests to a Rocket endpoint. I believe that this does indeed suggest Rocket isn't the core of the problem here. The leap library doesn't seem to be the problem either though, as fully removing the library from the project doesn't prevent the segfaults from happening as I've said.
I would really like to attempt removing each library separately to figure out what might be causing it, but that is quite a lot of work (if even possible), so I'll drop that idea.

When running the project without the leap library in it, I'm getting similar results as I reported on the other machine. When running it normally know no segfault occurs. While running it through valgrind, a segfault comes up when making requests to Rocket.

Oh, and I didn't compile for musl yet.
Edit: I can't test compiling for musl sadly, because some crate requires ring which doesn't seem to support musl targets.

By the way, if you believe I should be searching for the issue somewhere else, because it probably isn't directly related to Rocket, you can close this issue.

@jebrosen
Copy link
Collaborator

Are you suggesting that valgrind may be causing a different allocator to be used?

No, not that. In the past I've had valgrind itself crash or mis-report because it didn't understand how jemalloc worked. Either valgrind or jemalloc has changed since and they appear to be compatible right now; I was just trying to point out that some of the reports might (still) be false positives.

I can't test compiling for musl sadly, because some crate requires ring which doesn't seem to support musl targets.

I deploy to the musl target, so I know it's possible. Was it really ring itself that failed or maybe you were missing part of the musl toolchain? Regardless, I don't think testing with musl at this point will help solve the original problem any faster than the other things you've tested.

Do you have valgrind reports (with the extra flags from before) for any of the new failures?

@SergioBenitez
Copy link
Member

I believe that what @jebrosen is suggesting - that Valgrind is causing the segfault due to its strange interactions with jemalloc - is correct (see rust-lang/rust#49183). @timvisee Do you see segfaults or incorrect outputs in your Rocket application, with or without jemalloc, without using valgrind? If not, I think there's sufficient evidence to close this issue.

@SergioBenitez
Copy link
Member

Given what we've learned, I'm fairly certain there isn't an issue with Rocket here. As such, I'm closing this out.

@SergioBenitez SergioBenitez added no bug The reported bug was confirmed nonexistent and removed triage A bug report being investigated labels Oct 22, 2018
@timvisee
Copy link
Author

Sadly because I'm quite busy at the moment, I couldn't experiment any further. So, sorry for me not responding.

I'm sure you're right though. Thanks a bunch anyway for looking into the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no bug The reported bug was confirmed nonexistent
Projects
None yet
Development

No branches or pull requests

3 participants