-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate flaky test-tick-processor #4427
Comments
Confirmed via https://ci.nodejs.org/job/node-stress-single-test/208/nodes=win-vs2013/console but indirectly. It's been hung for a long time now, which would timeout in regular CI. (/cc @joaocgreis Am I right that there's no timeout on the stress test? Would it be easy to add in Jenkins? Or should I use a wrapper in JS that runs the test with a timeout?) |
@Trott timeout only seems to be passed to |
Failure on Debian 8: https://ci.nodejs.org/job/node-test-commit-linux/1600/nodes=debian8-64/console |
I don't think extending the timeout would help after all. I suspect the test is actually hanging indefinitely. |
@Trott you're right, the stress-single-test job has no timeout. If someone wants to change |
@joaocgreis As it is currently written, it can run a single test like this:
Is that what you need? If not, can you explain a bit more? I'm totally up for hacking on |
@Trott I didn't know about that, thanks! Now it's using the test runner. |
@joaocgreis That's great! One tweak: If I'm understanding what's going on correctly, it now appears to take into account if the test is marked flaky. (At least, that is consistent with what I'm seeing trying to run stress tests...) For the stress test, we probably always want to ignore the flaky designation and fail if the test failed. Can we pass in the command line flag to tell it to fail if a flaky test fails? |
So back to the original problem here, with the test being flaky on Windows... /cc @matthewloring |
Stress test for good measure showing it still times out sometimes on Windows: https://ci.nodejs.org/job/node-stress-single-test/282/nodes=win2012r2/console Gotta be the first half of the test because the second half is skipped on Windows... |
@Trott I can look into this. The only part of that test that should take time is running the code to be profiled: function f() {
for (var i = 0; i < 1000000; i++) {
i++;
}
setImmediate(function() { f(); });
};
setTimeout(function() { process.exit(0); }, 2000);
f(); The stress test output has a lot of lines saying: |
@matthewloring It's a measurement in seconds. (Yeah, I really dislike the |
So anyway, yeah, something funky is happening because the test takes about 3 seconds to run most times, but then once in a while, it times out (which means taking more than 60 seconds). |
Seems to fail on the Rackspace hosts but I wonder if it ever fails on the Microsoft Azure hosts. Let's find out. Stress test on Microsoft Azure host: I believe the main difference between our Rackspace hosts and Azure hosts is that the Rackspace ones are multiprocessor and the Azure ones are single processor. EDIT: Looks like it fails on both. |
@Trott re |
@joaocgreis My mistake, I guess! It sure looks like it works the way you say. Thanks! |
ping/bump/whatever @matthewloring and @nodejs/platform-windows Thanks in advance for any help anyone can provide in moving this forward. |
@Trott I ran the test in a loop for 4000 iterations (I stopped it after a few hours) and couldn't reproduce the timeout in my local environment. I'm testing on a Windows 7 vm (single core, 2GB ram). Do you have any suggestions on environment changes that should be made to faithfully model the test environment? |
@matthewloring I'm not sure how that compares to our CI hosts. Help, @nodejs/build? (It fails on both the Rackspace and the Azure Windows hosts.) I do know that on CI it fails on Windows 2008 and Windows 2012 servers. I don't think we have any Windows 7 servers in CI, so that's one difference there... |
@Trott this failed me a couple of times yesterday in |
@santigimeno When you say it failed, did it hang/timeout? Or did it report something? |
@Trott Sorry, it timed out |
@matthewloring What's the reason that instead of this:
...we can't just do this:
I'm sure there's a reason, and I've probably had this explained to me before. (I'll try to summarize it in a code comment or something this time.) Something to do with preventing optimization or something like that? |
@Trott The goal of the |
PR-URL: nodejs#4809 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Johan Bergström <[email protected]> Reviewed-By: James M Snell <[email protected]> Ref: nodejs#4427
I ran a stress test on this on every readily-available platform on CI: https://ci.nodejs.org/job/node-stress-single-test/762/ Failing platforms: armv7-ubuntu1404, osx1010, ubuntu1404-32, win10-1p, win2008r2, win2008r2-p1, win2012r2, win2012r2-p1 A few more may fail--not all have finished running as of right now. Seems that the problem is especially pronounced on Windows. Here's a sample failure on win10-p1:
|
PR in for preventing the hang. The question I have is really about what this test should be testing? The name suggests that we're testing the tick processor, but what its really testing is that the test function f() is compiled by V8 in the normal execution mode, and when the VM has the Debug context enabled (on only some OSs). If anyone has any guidance on what exactly this should be testing, I'd be happy to continue improving it. |
This test is designed test correctness of the integration of the V8 tick processor with Node.js. Despite sharing the name, |
Ahh, thanks. That makes sense. |
Typical problem is a timeout:
Like, the |
Small sample size, but it sure seems like the test is failing a lot more since #8480 landed earlier today. Maybe that provides some clue as to how it might be fixed for good? Looping in stats-ish people along with usual suspects because |
Actually, it looks like it's failing in a different way now, and more frequently. :-/ Is it possible we've broken the v8 profiling integration on some platforms now? Some sample failures: https://ci.nodejs.org/job/node-test-commit-linuxone/1136/nodes=rhel72-s390x/console
|
The likelihood that the test will pass is proportional to the amount of time the test function spends with the |
Perhaps we can add a computationally intensive built-in for the express purpose of being able to test that it shows up on the stack. WDYT? |
I also wonder if we should split this into three separate tests so that:
|
I was thinking about it recently. What if we would wait until the symbol will appear in the profile log? |
(Not that it is going to be trivial, though. It will involve finding the symbol's address in the output of |
Wait for a sought-for symbol to appear instead of just hard-killing subprocesses at 2s timeout. Fix: nodejs#4427
I'm going to remove the windows label from this, as we're seeing failures on lots of platforms, and I'm going to add the |
Removing |
Wait for a sought-for symbol to appear instead of just hard-killing subprocesses at 2s timeout. Fix: #4427 PR-URL: #8542 Reviewed-By: Rich Trott <[email protected]>
Just spotted on
https://ci.nodejs.org/job/node-test-commit-arm/11208/nodes=armv7-wheezy/ |
FWIW, the test is inherently non-deterministic and thus kind of perma-flaky. Ways to make it more robust would be welcome, of course. |
Failure on windows CI: https://ci.nodejs.org/job/node-test-binary-windows/395/RUN_SUBSET=1,VS_VERSION=vs2013,label=win2008r2/tapTestReport/test.tap-189/
The text was updated successfully, but these errors were encountered: