-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FullClusterRestartIT.testRollupIDSchemeAfterRestart failed #32773
Comments
Pinging @elastic/es-search-aggs |
@polyfractal could you take a look at this one? |
Hm, I'm not positive what's going on here but I have a guess. Will open a PR with a test fix and get Jim's opinion. |
We only upgrade the ID when the state is saved in one of four scenarios: - when we reach a checkpoint (every 50 pages) - when we run out of data - when explicitly stopped - on failure The test was relying on the pre-upgrade to finish, save state and then the post-upgrade to start, hit the end of data and upgrade ID. THEN get the new doc and apply the new ID. But I think this is vulnerable to timing issues. If the pre-upgrade portion shutdown before it saved the state, when restarting we would run through all the data from the beginning with the old ID, meaning both docs would still have the old scheme. This change makes the pre-upgrade wait for the job to go back to STARTED so that we know it persisted the end point. Post-upgrade, it stops and restarts the job to ensure the state was persisted and the ID upgraded. That _should_ rule out the above timing issue. Closes elastic#32773
We only upgrade the ID when the state is saved in one of four scenarios: - when we reach a checkpoint (every 50 pages) - when we run out of data - when explicitly stopped - on failure The test was relying on the pre-upgrade to finish, save state and then the post-upgrade to start, hit the end of data and upgrade ID. THEN get the new doc and apply the new ID. But I think this is vulnerable to timing issues. If the pre-upgrade portion shutdown before it saved the state, when restarting we would run through all the data from the beginning with the old ID, meaning both docs would still have the old scheme. This change makes the pre-upgrade wait for the job to go back to STARTED so that we know it persisted the end point. Post-upgrade, it stops and restarts the job to ensure the state was persisted and the ID upgraded. That _should_ rule out the above timing issue. Closes #32773
We only upgrade the ID when the state is saved in one of four scenarios: - when we reach a checkpoint (every 50 pages) - when we run out of data - when explicitly stopped - on failure The test was relying on the pre-upgrade to finish, save state and then the post-upgrade to start, hit the end of data and upgrade ID. THEN get the new doc and apply the new ID. But I think this is vulnerable to timing issues. If the pre-upgrade portion shutdown before it saved the state, when restarting we would run through all the data from the beginning with the old ID, meaning both docs would still have the old scheme. This change makes the pre-upgrade wait for the job to go back to STARTED so that we know it persisted the end point. Post-upgrade, it stops and restarts the job to ensure the state was persisted and the ID upgraded. That _should_ rule out the above timing issue. Closes #32773
We only upgrade the ID when the state is saved in one of four scenarios: - when we reach a checkpoint (every 50 pages) - when we run out of data - when explicitly stopped - on failure The test was relying on the pre-upgrade to finish, save state and then the post-upgrade to start, hit the end of data and upgrade ID. THEN get the new doc and apply the new ID. But I think this is vulnerable to timing issues. If the pre-upgrade portion shutdown before it saved the state, when restarting we would run through all the data from the beginning with the old ID, meaning both docs would still have the old scheme. This change makes the pre-upgrade wait for the job to go back to STARTED so that we know it persisted the end point. Post-upgrade, it stops and restarts the job to ensure the state was persisted and the ID upgraded. That _should_ rule out the above timing issue. Closes #32773
I'm not sure if this is strictly related but this failure on 6.x is from one of the lines added in b32fbbe Full log is too large to upload sorry. Doesn't reproduce for me
|
Grr, I think it's trying to stop the job before the task is fully initialized. I should have left the awaitBusy for the persistent task. I'll take care of this shortly... if it keeps failing before I get there feel free to mute. Sorry for the noise! |
I just AwaitFix'ed this test. |
We need to wait for the job to fully initialize and start before we can attempt to stop it. If we don't, it's possible for the stop API to be called before the persistent task is fully loaded and it'll throw an exception. Closes #32773
We need to wait for the job to fully initialize and start before we can attempt to stop it. If we don't, it's possible for the stop API to be called before the persistent task is fully loaded and it'll throw an exception. Closes #32773
Thanks Adrien. I think this test is fixed now. But if it fails again, I'll re-evaluate how we're doing things in the test and see if there's a way to rewrite it to be less flaky. Sorry for the noise everyone! |
This test is a gift that keeps on giving 😁 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/3/console failed as follows:
Allegedly, but not for me:
I wonder if perhaps the following line should have an Line 392 in d725416
|
Gift that keeps on giving indeed :(
Unfortunately, it does that in Line 634 in d725416
The failure seems to be during the assertion of the job details, as provided by the Tasks API. This occurs after we have confirmed the job exists in the GetRollupJob API (which means the persistent task exists). I wonder if we need another await in there, to account for the allocated task taking slightly longer to "exist" after the persistent task is created. Not sure, will poke around some today. Ugh. :( |
So it does, sorry, I must have misread something. |
No worries, if that'd been the case I would have been overjoyed at a simple fix :) |
Another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+bwc-tests/61/console
|
@polyfractal if something were to wipe all the persistent tasks from the cluster state in between the upgraded cluster starting up and |
@droberts195 Oh! Yes, I bet that would/could have caused some of these failures. This test:
Step 3. would definitely break if the state were wiped in between 1 and 3, since the job wouldn't exist and the assertions would fail (which is what the failures in this issue are showing). I don't think shifting assignments would cause failures in this case, since we're just asserting that the job exists somewhere and has been upgraded correctly. But if the state were totally wiped away I certainly would. |
Oops, closed via that test fix which predated the ^^ comments. I think the fix in #38218 probably contributed to some of the failures, but the snapshot/restore could easily have contributed to the others. |
This has reappeared in the 6.6 BWC tests
|
Thanks for the ping. Looks like it might be a missed backport. Investigating |
Ok, backported to 6.6. I believe that particular fix has been fully backported, so if it fails again it may be the issue @droberts195 mentioned... or something else. Closing for now. 🤞 |
This test failed in CI, but I was unable to reproduce it on osx or linux.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=amazon/1231/console
Also note that there are still running tasks according to the build output.
The text was updated successfully, but these errors were encountered: