roachtest: add one more version upgrade step #47235

spaskob · 2020-04-08T23:32:47Z

We need to add an intermediate upgrade step to 19.2.5 to prevent the
KV store layer complaining that HEAD binary is too new.
This PR also adds some logging that will help debugging test failure
logs.

Fixes CI failure:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423

Release note: none.

cockroach-teamcity · 2020-04-08T23:32:54Z

This change is

We need to add an intermediate upgrade step to `19.2.5` to prevent the KV store layer complaining that `HEAD` binary is too new. This PR also adds some logging that will help debugging test failure logs. Release note: none.

Release note: none.

blathers-crl · 2020-04-09T03:02:38Z

❌ The GitHub CI (Cockroach) build has failed on feddb936.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. I am experimental - my owner is otan.}

tbg · 2020-04-09T08:47:35Z

Thanks for the attempt! I don't think this is the right fix. In fact, I don't fully understand what's broken yet. The error we're seeing in the test is that even though all nodes in the cluster at one point acknowledged a cluster version of 19.2, they still have 19.1 persisted on at least one engine. There is an invariant that the cluster version gets persisted on the engine before it is ever acked by a node. Previously, we neglected to check one of the nodes (the last one), but this got fixed in #47201 which was present when https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423 failed.

What's suspicious is that the failure mode is exactly the same - it's the last node that didn't upgrade.

I will first stress this test to see if the flake is really there or if we're perhaps dealing with a buildsystem-sponsored stale roachtest build.

Regarding your fix: version upgrades only care about the major.minor. You can upgrade from any patch release to any next major release's patch release. So going from 19.1.x to 19.2.x to HEAD is fine. The error message from the test,

store =/home/roach/local/4/data, last used with cockroach version v19.1, is too old for running version v19.2-17 (which requires data from v19.1-1 or later)

indicates that the engine never persisted the cluster version v19.2 to which the test claimed it had checked everyone had upgraded.

tbg · 2020-04-09T09:06:49Z

BTW, I think this problem here isn't new (whatever the exact problem is). What's new is that while rewriting the test, I optimized it and sped up the feature tests by a lot. As a result, nodes get started/stopped much faster than before, which likely first exposed this problem.

tbg · 2020-04-09T09:13:14Z

I just ran this 20x against real clusters (i.e. same thing CI does). No failures (ok, one failure, but had nothing to do with the test).

I still strongly suspect that we had not picked up #47201 but I have checked twice that we ought to have done so in that build. Going to spin up another 20.

tbg · 2020-04-09T09:13:36Z

Oh wait, CI runs this locally. Maybe that's the secret? Going to try that as well.

tbg · 2020-04-09T09:50:54Z

I'm getting repros running locally. That's nice, I will take a look.

What sucks is that I can't easily instrument the CRDB binary, as I would need to instrument the old versions, not HEAD. So I hope to find an "obvious" problem in the roachtest.

tbg · 2020-04-09T10:03:16Z

Every repo I get has n4 being the offender. I'm sure I messed something up when changing this test. SHOW CLUSTER SETTING version is returning the right thing, so I must be running it against the wrong node. I just added caching of *gosql.DBs, surely I managed to get that wrong.

tbg · 2020-04-09T11:25:34Z

Ok so my new theory is that the invariant I mentioned above (Gossip acks a cluster version only when it's on disk) is not upheld at startup. That is, there is a moment in time at which the cluster version comes in via gossip but the callback that writes it to disk (before putting it into gossip) is not registered yet. In that case, the node will reflect the new cluster version, but if it is promptly shut down, it won't be on disk. I can't explain everything going wrong in this test but I think the fact that it always affects n4 is because the 19.1.5 fixture for n4 doesn't reflect the 19.1 cluster version yet. I have definitely seen failures because of that (going from 2.1 to 19.2), what I understand less are the failures where n4 reflects 19.1 and we're going into HEAD.

I looked at the 19.1 code and it's inscrutable (thanks @andreimatei for rewriting it, much better now). But the bottom line is that there's always a hole where a cluster version bump could sneak in before we are setting the callback that is supposed to fire before every bump. This "hole" can be plugged and I will do so.

This won't be easy on the old versions, so I won't try. However, I'll make sure the test doesn't hit it any more.

The recent changes to this test seem to have uncovered a latent [bug], as a result of which the test was skipped. This commit lets the test start from v19.2 fixtures (instead of v19.1 before) which experimentally were verified to not exhibit the bug, so the test is unskipped. While I was there, I automated the base version discovery of this test. On a branch at version X, the test will now use the fixtures for the last major release preceding X. We'll occasionally have to bake new fixtures and update the mapping in PredecessorVersion() but this is better than having the test rot on older branches whenever we update it for a new dev cycle. The root cause of the bug will be fixed separately; this is tracked in issue cockroachdb#44732. [bug]: cockroachdb#47235 (comment) Release note: None

47232: opt: corrects function doc for canMaybeConstrainIndex r=mgartner a=mgartner #### opt: corrects function doc for canMaybeConstrainIndex This commit corrects a the function documentation for canMaybeConstrainIndex. Prior to this change, it did not mention a third check, for tight filter constraints, that is performed to determinte the possibility that an index could be constrained by a filer. Also, the documentation now correctly maps to the logic of the function. Previously it falsly claimed that if any of the checks were false then the function would return false. Now it correctly states that if any of the checks are true, then the fucntion returns true. Release note: None 47268: roachtest: improve and de-flake version-upgrade r=spaskob a=tbg roachtest: automate and de-flake version-upgrade The recent changes to this test seem to have uncovered a latent [bug], as a result of which the test was skipped. This commit lets the test start from v19.2 fixtures (instead of v19.1 before) which experimentally were verified to not exhibit the bug, so the test is unskipped. While I was there, I automated the base version discovery of this test. On a branch at version X, the test will now use the fixtures for the last major release preceding X. We'll occasionally have to bake new fixtures and update the mapping in PredecessorVersion() but this is better than having the test rot on older branches whenever we update it for a new dev cycle. The root cause of the bug will be fixed separately; this is tracked in issue #44732. [bug]: #47235 (comment) Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

A central invariant around cluster versions is that when an update arrives, we need to persist it before exposing it through the version setting. This was not true during server start time, as described in: cockroachdb#47235 (comment) In short, we were registering the callback to persist to the engines *after* Gossip had already connected, opening up a window during which a cluster version bump simply would not be persisted. In the acceptance/version-upgrade test, this would manifest as nodes refusing to start because their binary version had proceeded to far beyond the persisted version. At the time of writing and before this commit, this would happen in perhaps 1-10% of local runs on a linux machine (rarely on OSX). The buggy code was originally written when the startup sequence was a lot more intricate and we were, as I recall, under pressure to deliver. Now, with recent refactors around the `initServer`, we're in a really good place to solve this while also simplifying the whole story. In this commit, - persist the cluster version on *all* engines, not just the initialized ones; this completely decouples the timing of when the engines get initialized from when we can set up the persistence callback and makes everything *much simpler*. - stop opportunistically backfilling the cluster version. It is now done once at the beginning, in the right place, without FUD later on. - remove the cluster version persistence from engine initialization. Anyone initializing an engine must put a cluster version on it first. In a running server, this happens during initServer creation time, way before there are any moving parts in the system. In tests, extra writes were added as needed. - set up the callback with Gossip before starting Gossip, and make sure (via an assertion) that this property does not rot. By setting up the callback before Gossip starts, we make sure there isn't a period during which Gossip receives an update but doesn't have the callback yet. - as a result of all of the above, take all knowledge of cluster version init away from `*Node` and `*Stores`. As a last note, we are planning to stop using Gossip for this version business in 20.1, though this too will be facilitated by this change. Release note (bug fix): Avoid a condition during rapid version upgrades where a node would refuse to start, claiming "[a store is] too old for running version". Before this bug fix, the workaround is to decommission the node, delete the store directory, and re-add it to the cluster as a new node.

47358: server: rework cluster version initialization r=irfansharif a=tbg A central invariant around cluster versions is that when an update arrives, we need to persist it before exposing it through the version setting. This was not true during server start time, as described in: #47235 (comment) In short, we were registering the callback to persist to the engines *after* Gossip had already connected, opening up a window during which a cluster version bump simply would not be persisted. In the acceptance/version-upgrade test, this would manifest as nodes refusing to start because their binary version had proceeded to far beyond the persisted version. At the time of writing and before this commit, this would happen in perhaps 1-10% of local runs on a linux machine (rarely on OSX). The buggy code was originally written when the startup sequence was a lot more intricate and we were, as I recall, under pressure to deliver. Now, with recent refactors around the `initServer`, we're in a really good place to solve this while also simplifying the whole story. In this commit, - persist the cluster version on *all* engines, not just the initialized ones; this completely decouples the timing of when the engines get initialized from when we can set up the persistence callback and makes everything *much simpler*. - stop opportunistically backfilling the cluster version. It is now done once at the beginning, in the right place, without FUD later on. - remove the cluster version persistence from engine initialization. Anyone initializing an engine must put a cluster version on it first. In a running server, this happens during initServer creation time, way before there are any moving parts in the system. In tests, extra writes were added as needed. - set up the callback with Gossip before starting Gossip, and make sure (via an assertion) that this property does not rot. By setting up the callback before Gossip starts, we make sure there isn't a period during which Gossip receives an update but doesn't have the callback yet. - as a result of all of the above, take all knowledge of cluster version init away from `*Node` and `*Stores`. As a last note, we are planning to stop using Gossip for this version business in 20.1, though this too will be facilitated by this change. Release note (bug fix): Avoid a condition during rapid version upgrades where a node would refuse to start, claiming "[a store is] too old for running version". Before this bug fix, the workaround is to decommission the node, delete the store directory, and re-add it to the cluster as a new node. 47457: clusterversion: introduce 20.2 development versions r=dt a=dt Release note: none. Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: David Taylor <[email protected]>

spaskob requested review from andreimatei and tbg April 8, 2020 23:35

spaskob added 3 commits April 8, 2020 22:15

roachtest: add one more version upgrade step

a43cb7f

We need to add an intermediate upgrade step to `19.2.5` to prevent the KV store layer complaining that `HEAD` binary is too new. This PR also adds some logging that will help debugging test failure logs. Release note: none.

roachtest: fix some minor style problems

c36e621

Release note: none.

roachtest: unskip acceptance/version-upgrade

feddb93

Release note: none.

spaskob force-pushed the fix-rt branch from e19b51c to feddb93 Compare April 9, 2020 02:19

tbg mentioned this pull request Apr 9, 2020

roachtest: acceptance/version-upgrade failed #44732

Closed

tbg mentioned this pull request Apr 9, 2020

roachtest: improve and de-flake version-upgrade #47268

Merged

spaskob closed this Apr 9, 2020

tbg mentioned this pull request Apr 10, 2020

server: rework cluster version initialization #47358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add one more version upgrade step #47235

roachtest: add one more version upgrade step #47235

spaskob commented Apr 8, 2020

cockroach-teamcity commented Apr 8, 2020

blathers-crl bot commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

roachtest: add one more version upgrade step #47235

roachtest: add one more version upgrade step #47235

Conversation

spaskob commented Apr 8, 2020

cockroach-teamcity commented Apr 8, 2020

blathers-crl bot commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020

tbg commented Apr 9, 2020