-
-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/python/pants_test/base:exception_sink_integration is flaky #8127
Comments
Seen in #8123. |
Seen again in #8099. |
Seen again on master. |
Seen in #8143. |
Seen again in master. This is probably our highest priority flaky test, as it seems to just hang fairly frequently. |
Seen again on the OSX shard in #8153. The timeout for this one is now 540, and it takes about 30 seconds to run locally on OSX, so there is something strange happening. Maybe we're being forced to re-boostrap or recompile? Or it is just hanging. |
Seen in #8150. |
Seen in #8192. It's not the first time I see it, but it is the first time I comment here. Overall, I think there is no doubt this one regularly exceeds its timeout. |
Seen again in master. |
Seen again in #8201. |
Seen again in #8221 on the OSX shard. |
Seen in #8223 in |
Seen again in OSX platform-specific tests - time out of 540. @stuhood we should probably lower the timeout to less than 540 because this appears to be an issue with hanging forever? That way it eagerly fails. I do not think this is an issue with trying to re-bootstrap |
Seen in #8233 in OSX platform-specific tests shard |
Seen in #8276 in OSX platform-specific tests shard |
Seen again in #8113. |
Seen again in #8088. |
Seen again in #8406 |
Seen again in #8452 |
I'm looking into this today. I agree with Stu that this is likely our highest priority flake. Locally, I ran a script to repeat until failure. First run, it took 71 attempts. Second run, it took 131 attempts to fail. This translates to 1.3% of runs failing and 0.7% of runs failing, respectively. In CI, it seems the number is closer to 20%. I'm going to try debugging in CI instead. |
On a successful OSX shard, the test takes 5 minutes to run. Locally on OSX, it takes 30-35 seconds. Something seems to be going on with Travis. These were the individual tests that took longer than local execution:
EDIT: the common denominator for all of these tests is pants/tests/python/pants_test/base/test_exception_sink_integration.py Lines 118 to 141 in 9ef7954
|
### Skip some exception sink integration tests on macOS These shards have chronically flaked by hanging since at least July #8127. They are our most egregious Python flake. We will still test these four tests on Linux and only skip them on macOS, as this is a macOS specific issue. ### Skip `remote::tests::dropped_request_cancels` This seems to flake roughly 40% of the time #8405. It is our most egregious Rust flake. ### Tweak some other tests We mark some other tests as flaky and bump their timeouts as relevant to hopefully stabilize CI further.
### Problem The setup and teardown of each request made to the nailgun server in `pantsd` had become quite complicated over time... and consequently, slower than it needed to be. ### Solution Port `pantsd`'s nailgun server to rust using the `nails` crate. Additionally, remove the `Exiter` class, which had accumulated excess responsibilities that can instead be handled by returning `ExitCode` values. Finally, fix a few broken windows including: double logging to pantsd, double help output, closed file errors on pantsd shutdown, and redundant setup codepaths. ### Result There is less code to maintain, and runs of `./pants --enable-pantsd help` take `~1.7s`, of which `~400ms` are spent in the server. Fixes #9448, fixes #8243, fixes #8206, fixes #8127, fixes #7653, fixes #7613, fixes #7597.
I'm monitoring this one to decide what to do. |
### Problem A while back we started capturing core dumps "globally" in travis. But in practice we have never consumed them, and I'm fairly certain that they are causing the OSX shards that test sending `SIGABRT` (which, if core dumps are enabled, will trigger a core dump) to `pantsd` to: 1. be racey, because while the core is dumping, the process is non-responsive and can't be killed, leading to errors like: ```FAILURE: failure while terminating pantsd: failed to kill pid 28775 with signals (<Signals.SIGTERM: 15>, <Signals.SIGKILL: 9>)``` 2. run out of disk space: we've seen mysterious "out of disk" errors on the OSX shards... and core dumps are large. ### Solution Disable core dumps everywhere. If we end up needing them in the future, we can enable them on a case-by-case basis. ### Result Fixes #8127. [ci skip-rust-tests] [ci skip-jvm-tests]
### Problem A while back we started capturing core dumps "globally" in travis. But in practice we have never consumed them, and I'm fairly certain that they are causing the OSX shards that test sending `SIGABRT` (which, if core dumps are enabled, will trigger a core dump) to `pantsd` to: 1. be racey, because while the core is dumping, the process is non-responsive and can't be killed, leading to errors like: ```FAILURE: failure while terminating pantsd: failed to kill pid 28775 with signals (<Signals.SIGTERM: 15>, <Signals.SIGKILL: 9>)``` 2. run out of disk space: we've seen mysterious "out of disk" errors on the OSX shards... and core dumps are large. ### Solution Disable core dumps everywhere. If we end up needing them in the future, we can enable them on a case-by-case basis. ### Result Fixes #8127. [ci skip-rust-tests] [ci skip-jvm-tests]
When run locally, this completes relatively quickly. But in some number of runs, it seems to hang forever, triggering a 360 second test timeout in travis.
The text was updated successfully, but these errors were encountered: