-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add one more version upgrade step #47235
Conversation
We need to add an intermediate upgrade step to `19.2.5` to prevent the KV store layer complaining that `HEAD` binary is too new. This PR also adds some logging that will help debugging test failure logs. Release note: none.
Release note: none.
Release note: none.
❌ The GitHub CI (Cockroach) build has failed on feddb936. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. I am experimental - my owner is otan. |
Thanks for the attempt! I don't think this is the right fix. In fact, I don't fully understand what's broken yet. The error we're seeing in the test is that even though all nodes in the cluster at one point acknowledged a cluster version of 19.2, they still have 19.1 persisted on at least one engine. There is an invariant that the cluster version gets persisted on the engine before it is ever acked by a node. Previously, we neglected to check one of the nodes (the last one), but this got fixed in #47201 which was present when https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423 failed. What's suspicious is that the failure mode is exactly the same - it's the last node that didn't upgrade. I will first stress this test to see if the flake is really there or if we're perhaps dealing with a buildsystem-sponsored stale roachtest build. Regarding your fix: version upgrades only care about the
indicates that the engine never persisted the cluster version v19.2 to which the test claimed it had checked everyone had upgraded. |
BTW, I think this problem here isn't new (whatever the exact problem is). What's new is that while rewriting the test, I optimized it and sped up the feature tests by a lot. As a result, nodes get started/stopped much faster than before, which likely first exposed this problem. |
I just ran this 20x against real clusters (i.e. same thing CI does). No failures (ok, one failure, but had nothing to do with the test). I still strongly suspect that we had not picked up #47201 but I have checked twice that we ought to have done so in that build. Going to spin up another 20. |
Oh wait, CI runs this locally. Maybe that's the secret? Going to try that as well. |
I'm getting repros running locally. That's nice, I will take a look. What sucks is that I can't easily instrument the CRDB binary, as I would need to instrument the old versions, not HEAD. So I hope to find an "obvious" problem in the roachtest. |
Every repo I get has n4 being the offender. I'm sure I messed something up when changing this test. |
Ok so my new theory is that the invariant I mentioned above (Gossip acks a cluster version only when it's on disk) is not upheld at startup. That is, there is a moment in time at which the cluster version comes in via gossip but the callback that writes it to disk (before putting it into gossip) is not registered yet. In that case, the node will reflect the new cluster version, but if it is promptly shut down, it won't be on disk. I can't explain everything going wrong in this test but I think the fact that it always affects n4 is because the 19.1.5 fixture for n4 doesn't reflect the 19.1 cluster version yet. I have definitely seen failures because of that (going from 2.1 to 19.2), what I understand less are the failures where n4 reflects 19.1 and we're going into HEAD. I looked at the 19.1 code and it's inscrutable (thanks @andreimatei for rewriting it, much better now). But the bottom line is that there's always a hole where a cluster version bump could sneak in before we are setting the callback that is supposed to fire before every bump. This "hole" can be plugged and I will do so. This won't be easy on the old versions, so I won't try. However, I'll make sure the test doesn't hit it any more. |
The recent changes to this test seem to have uncovered a latent [bug], as a result of which the test was skipped. This commit lets the test start from v19.2 fixtures (instead of v19.1 before) which experimentally were verified to not exhibit the bug, so the test is unskipped. While I was there, I automated the base version discovery of this test. On a branch at version X, the test will now use the fixtures for the last major release preceding X. We'll occasionally have to bake new fixtures and update the mapping in PredecessorVersion() but this is better than having the test rot on older branches whenever we update it for a new dev cycle. The root cause of the bug will be fixed separately; this is tracked in issue cockroachdb#44732. [bug]: cockroachdb#47235 (comment) Release note: None
47232: opt: corrects function doc for canMaybeConstrainIndex r=mgartner a=mgartner #### opt: corrects function doc for canMaybeConstrainIndex This commit corrects a the function documentation for canMaybeConstrainIndex. Prior to this change, it did not mention a third check, for tight filter constraints, that is performed to determinte the possibility that an index could be constrained by a filer. Also, the documentation now correctly maps to the logic of the function. Previously it falsly claimed that if any of the checks were false then the function would return false. Now it correctly states that if any of the checks are true, then the fucntion returns true. Release note: None 47268: roachtest: improve and de-flake version-upgrade r=spaskob a=tbg roachtest: automate and de-flake version-upgrade The recent changes to this test seem to have uncovered a latent [bug], as a result of which the test was skipped. This commit lets the test start from v19.2 fixtures (instead of v19.1 before) which experimentally were verified to not exhibit the bug, so the test is unskipped. While I was there, I automated the base version discovery of this test. On a branch at version X, the test will now use the fixtures for the last major release preceding X. We'll occasionally have to bake new fixtures and update the mapping in PredecessorVersion() but this is better than having the test rot on older branches whenever we update it for a new dev cycle. The root cause of the bug will be fixed separately; this is tracked in issue #44732. [bug]: #47235 (comment) Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
A central invariant around cluster versions is that when an update arrives, we need to persist it before exposing it through the version setting. This was not true during server start time, as described in: cockroachdb#47235 (comment) In short, we were registering the callback to persist to the engines *after* Gossip had already connected, opening up a window during which a cluster version bump simply would not be persisted. In the acceptance/version-upgrade test, this would manifest as nodes refusing to start because their binary version had proceeded to far beyond the persisted version. At the time of writing and before this commit, this would happen in perhaps 1-10% of local runs on a linux machine (rarely on OSX). The buggy code was originally written when the startup sequence was a lot more intricate and we were, as I recall, under pressure to deliver. Now, with recent refactors around the `initServer`, we're in a really good place to solve this while also simplifying the whole story. In this commit, - persist the cluster version on *all* engines, not just the initialized ones; this completely decouples the timing of when the engines get initialized from when we can set up the persistence callback and makes everything *much simpler*. - stop opportunistically backfilling the cluster version. It is now done once at the beginning, in the right place, without FUD later on. - remove the cluster version persistence from engine initialization. Anyone initializing an engine must put a cluster version on it first. In a running server, this happens during initServer creation time, way before there are any moving parts in the system. In tests, extra writes were added as needed. - set up the callback with Gossip before starting Gossip, and make sure (via an assertion) that this property does not rot. By setting up the callback before Gossip starts, we make sure there isn't a period during which Gossip receives an update but doesn't have the callback yet. - as a result of all of the above, take all knowledge of cluster version init away from `*Node` and `*Stores`. As a last note, we are planning to stop using Gossip for this version business in 20.1, though this too will be facilitated by this change. Release note (bug fix): Avoid a condition during rapid version upgrades where a node would refuse to start, claiming "[a store is] too old for running version". Before this bug fix, the workaround is to decommission the node, delete the store directory, and re-add it to the cluster as a new node.
A central invariant around cluster versions is that when an update arrives, we need to persist it before exposing it through the version setting. This was not true during server start time, as described in: cockroachdb#47235 (comment) In short, we were registering the callback to persist to the engines *after* Gossip had already connected, opening up a window during which a cluster version bump simply would not be persisted. In the acceptance/version-upgrade test, this would manifest as nodes refusing to start because their binary version had proceeded to far beyond the persisted version. At the time of writing and before this commit, this would happen in perhaps 1-10% of local runs on a linux machine (rarely on OSX). The buggy code was originally written when the startup sequence was a lot more intricate and we were, as I recall, under pressure to deliver. Now, with recent refactors around the `initServer`, we're in a really good place to solve this while also simplifying the whole story. In this commit, - persist the cluster version on *all* engines, not just the initialized ones; this completely decouples the timing of when the engines get initialized from when we can set up the persistence callback and makes everything *much simpler*. - stop opportunistically backfilling the cluster version. It is now done once at the beginning, in the right place, without FUD later on. - remove the cluster version persistence from engine initialization. Anyone initializing an engine must put a cluster version on it first. In a running server, this happens during initServer creation time, way before there are any moving parts in the system. In tests, extra writes were added as needed. - set up the callback with Gossip before starting Gossip, and make sure (via an assertion) that this property does not rot. By setting up the callback before Gossip starts, we make sure there isn't a period during which Gossip receives an update but doesn't have the callback yet. - as a result of all of the above, take all knowledge of cluster version init away from `*Node` and `*Stores`. As a last note, we are planning to stop using Gossip for this version business in 20.1, though this too will be facilitated by this change. Release note (bug fix): Avoid a condition during rapid version upgrades where a node would refuse to start, claiming "[a store is] too old for running version". Before this bug fix, the workaround is to decommission the node, delete the store directory, and re-add it to the cluster as a new node.
47358: server: rework cluster version initialization r=irfansharif a=tbg A central invariant around cluster versions is that when an update arrives, we need to persist it before exposing it through the version setting. This was not true during server start time, as described in: #47235 (comment) In short, we were registering the callback to persist to the engines *after* Gossip had already connected, opening up a window during which a cluster version bump simply would not be persisted. In the acceptance/version-upgrade test, this would manifest as nodes refusing to start because their binary version had proceeded to far beyond the persisted version. At the time of writing and before this commit, this would happen in perhaps 1-10% of local runs on a linux machine (rarely on OSX). The buggy code was originally written when the startup sequence was a lot more intricate and we were, as I recall, under pressure to deliver. Now, with recent refactors around the `initServer`, we're in a really good place to solve this while also simplifying the whole story. In this commit, - persist the cluster version on *all* engines, not just the initialized ones; this completely decouples the timing of when the engines get initialized from when we can set up the persistence callback and makes everything *much simpler*. - stop opportunistically backfilling the cluster version. It is now done once at the beginning, in the right place, without FUD later on. - remove the cluster version persistence from engine initialization. Anyone initializing an engine must put a cluster version on it first. In a running server, this happens during initServer creation time, way before there are any moving parts in the system. In tests, extra writes were added as needed. - set up the callback with Gossip before starting Gossip, and make sure (via an assertion) that this property does not rot. By setting up the callback before Gossip starts, we make sure there isn't a period during which Gossip receives an update but doesn't have the callback yet. - as a result of all of the above, take all knowledge of cluster version init away from `*Node` and `*Stores`. As a last note, we are planning to stop using Gossip for this version business in 20.1, though this too will be facilitated by this change. Release note (bug fix): Avoid a condition during rapid version upgrades where a node would refuse to start, claiming "[a store is] too old for running version". Before this bug fix, the workaround is to decommission the node, delete the store directory, and re-add it to the cluster as a new node. 47457: clusterversion: introduce 20.2 development versions r=dt a=dt Release note: none. Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: David Taylor <[email protected]>
We need to add an intermediate upgrade step to
19.2.5
to prevent theKV store layer complaining that
HEAD
binary is too new.This PR also adds some logging that will help debugging test failure
logs.
Fixes CI failure:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423
Release note: none.