-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: locklinear.go failures with "lockmany: too slow" #32986
Comments
Is the machine just overloaded? |
The timings for 1000 ops being all over the map - 42ms, 13ms, 9ms, 5ms, 22ms - is a sign that the machine is just overloaded. Maybe if the test can't even get something consistent for 1000 ops it should just skip itself. |
This turns out not to be builder-specific — it has failed on a number of slowish builders. 2021-01-06T18:41:54-4c668b2/linux-ppc64-buildlet |
CC @aclements |
2021-09-22T15:46:33-5ee32ff/solaris-amd64-oraclerel |
I did a large number of all.bash runs on FreeBSD for other purposes, but got a fair number of locklinear failures in the process. On freebsd-amd64-12_2 10 of 898 runs failed in locklinear (1.1%). On freebsd-amd64-12_3, 6 of 912 runs failed in locklinear (0.66%). Those are fairly different rates given the number of runs, which I find a little surprising, but I think give us a sense of the ambient failure rate of this test when it's run in the context of everything else all.bash does. |
2022-02-11T20:01:24-bcee121/windows-386-2008 (Note that |
Given that this test won't be run by users as part of (That is: we should make a decision as to whether the test is important enough to deflake. If it is, we should deflake it; otherwise, we should suppress the flakes.) |
cc @golang/runtime |
I think the test is useful, i.e. verifying that the asymptotic behavior of contended Looking more closely at the test it already has a few workarounds for flakiness that appear to have been added in the few months after it was landed. There's an inherent flakiness to this kind of test, especially since it calls One thing that jumps out at me, though, is that the test doesn't ever block to clean up goroutines after each attempt, so theoretically it's possible that one failure could exacerbate subsequent ones if the problem is transient oversubscription (especially given the Another thing that seems kind of suspicious is that Maybe it's possible that on these slow platforms there's some kind of cache coherency breakdown? It's hard for me to tell whether the I'm inclined to do two things to make this test more robust:
And then we wait to see if it fails again. Does this seem reasonable? If we don't want to perturb things too much, we can just do (2). |
I made the aforementioned changes, and it definitely helps a bit. Even on my linux/amd64 GCP instance I get some results that are all over the place forcing a number of retries before my changes. There's still often at least one retry after my changes. But, thinking more about the test, comparing individual runs is always going to be fraught, so I instead made sure 10 runs are done apiece and ran a t-test on the results, just to see what would happen. The p-value threshold I set is 0.005 and if it doesn't meet that, I discard the results and complete the test. If it does, then I just make sure the 2n case's mean is within 3x of the n case. AFAICT this is more robust than before? In theory if there's no significant difference (due to e.g. noise) or the difference is inverted for some reason we can just discard the result entirely. We can also set our p-value threshold until we feel comfortable with the rate of false positives (though 0.005 threshold and the fairly lax constraints should already make that very low, and it should also detect real regressions). It does unfortunately make the test slower, but the methods are more rigorous. This method of testing also seems like it could be useful for other tests in this directory (like |
It occurs to me that I never stated this plainly: I don't think this failure represents a real regression. I think it's very clear from all my runs that the invariant it intends to check is generally upheld, and I also don't see some significant regression here in terms of how often it's upheld. What I understand this issue to be about is a flaky test and trying to make it less flaky. |
I agree, and I think the current labels ( But I also think we do need to make a decision about what information the test is providing. If these So I do think the action here is to decide: (a) is the information provided by the test worth the cost to deflake? and (b) if so, how should we deflake it? (If the answer to (a) is “no”, then we can just delete it instead. 😅) |
Looking at the test body and the most recent
That looks like this arbitrary-timeout cutoff: In addition to that cutoff, there is the 5-consecutive-failure cutoff (which results in failure if the running time was above the timeout), and an additional “5 inversions means too noisy” cutoff (https://cs.opensource.google/go/go/+/master:test/locklinear.go;l=54-57;drc=97677273532cf1a4e8b181c242d89c0be8c92bb6). But the test doesn't actually increase So I wonder: would it make sense to increase |
Change https://go.dev/cl/409894 mentions this issue: |
WDYT about my previous comment on just running a t-test (like benchstat)? I agree that the noise reduction attempts here aren't particularly rigorous. That, and removing the arbitrary cutoff (I left it in for now, but yeah it likely just makes things worse; a consistent failure due to high constant overheads would be better, so we can just fix the test). I put up https://go-review.googlesource.com/c/go/+/409894 if you'd like to take a look. There's a ton of copied code, but the same infrastructure could come in handy for other tests. It's unfortunate that we can't really depend on x/perf here, because that would avoid all the duplication. Honestly, this is kind of a performance test, and I think we should treat it like one. That might mean pulling this out and merging it into the rest of the performance tests once we have a dashboard. |
TBH my statistics chops were never great to begin with, and have only dulled further with time. 😅 I agree that a T-test seems like the right approach in general, although I'm not sure what an appropriate P-value would be for this kind of test. I do worry a bit that the test could decay if some regression causes it to consistently miss the significance threshold. Maybe we could guard against that by keeping the “increasing |
@mknyszek , I think you've got the right idea with a t-test here. In particular, don't have any retries, and instead just collect several samples from both the less and more contended scenarios to build up two distributions, and then make a decision based on those distributions. I haven't looked at your code closely, but your commit message describes what I think is the right approach. I do worry that picking a P-value threshold is effectively picking how flaky we're willing to allow this test to be, but nevertheless I think this is worth a shot. It's at least a lot more principled. Repeating the significance test if it fails sort of defeats the purpose of a significance test, so I don't think we should do that. Increasing n might help to smooth out the results, but that's basically just baking in a mean over a larger sample period. In general, statistical analysis should do better if we give it more samples with less pre-filtering, so it has as much information as possible. |
Change https://go.dev/cl/411396 mentions this issue: |
Change https://go.dev/cl/411395 mentions this issue: |
This CL copies code from github.com/aclements/go-moremath/stats and github.com/aclements/go-moremath/mathx for Welch's t-test. Several existing tests in the Go repository check performance and scalability, and this import is part of a move toward a more rigorous measurement of both. Note that the copied code is already licensed to Go Authors, so there's no need to worry about additional licensing considerations. For #32986. Change-Id: I058630fab7216d1a589bb182b69fa2231e6f5475 Reviewed-on: https://go-review.googlesource.com/c/go/+/411395 Reviewed-by: Michael Pratt <[email protected]>
This change moves test/locklinear.go into the sync package tests, and adds a bit of infrastructure since there are other linearity-checking tests that could benefit from it too. This infrastructure is also different than what test/locklinear.go does: instead of trying really hard to get at least one success, we instead treat this like a performance test and look for a significant difference via a t-test. This makes the methodology behind the tests more rigorous, and should reduce flakiness as transient noise should produce an insignificant result. A follow-up CL does more to make these tests even more robust. For #32986. Change-Id: I408c5f643962b70ea708930edb4ac9df1c6123ce Reviewed-on: https://go-review.googlesource.com/c/go/+/411396 Reviewed-by: Michael Pratt <[email protected]>
From the
illumos-amd64-joyent
builder, in https://build.golang.org/log/b60463ada3a04bbfdb7f714e54d819176e60a833:See previously #19276.
The text was updated successfully, but these errors were encountered: