Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add one more version upgrade step #47235

Closed
wants to merge 3 commits into from

Conversation

spaskob
Copy link
Contributor

@spaskob spaskob commented Apr 8, 2020

We need to add an intermediate upgrade step to 19.2.5 to prevent the
KV store layer complaining that HEAD binary is too new.
This PR also adds some logging that will help debugging test failure
logs.

Fixes CI failure:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423

Release note: none.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spaskob spaskob requested review from andreimatei and tbg April 8, 2020 23:35
spaskob added 3 commits April 8, 2020 22:15
We need to add an intermediate upgrade step to `19.2.5` to prevent the
KV store layer complaining that `HEAD` binary is too new.

This PR also adds some logging that will help debugging test failure
logs.

Release note: none.
@blathers-crl
Copy link

blathers-crl bot commented Apr 9, 2020

❌ The GitHub CI (Cockroach) build has failed on feddb936.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. I am experimental - my owner is otan.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

Thanks for the attempt! I don't think this is the right fix. In fact, I don't fully understand what's broken yet. The error we're seeing in the test is that even though all nodes in the cluster at one point acknowledged a cluster version of 19.2, they still have 19.1 persisted on at least one engine. There is an invariant that the cluster version gets persisted on the engine before it is ever acked by a node. Previously, we neglected to check one of the nodes (the last one), but this got fixed in #47201 which was present when https://teamcity.cockroachdb.com/viewLog.html?buildId=1861955&_focus=3423 failed.

What's suspicious is that the failure mode is exactly the same - it's the last node that didn't upgrade.

I will first stress this test to see if the flake is really there or if we're perhaps dealing with a buildsystem-sponsored stale roachtest build.

Regarding your fix: version upgrades only care about the major.minor. You can upgrade from any patch release to any next major release's patch release. So going from 19.1.x to 19.2.x to HEAD is fine. The error message from the test,

store =/home/roach/local/4/data, last used with cockroach version v19.1, is too old for running version v19.2-17 (which requires data from v19.1-1 or later)

indicates that the engine never persisted the cluster version v19.2 to which the test claimed it had checked everyone had upgraded.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

BTW, I think this problem here isn't new (whatever the exact problem is). What's new is that while rewriting the test, I optimized it and sped up the feature tests by a lot. As a result, nodes get started/stopped much faster than before, which likely first exposed this problem.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

I just ran this 20x against real clusters (i.e. same thing CI does). No failures (ok, one failure, but had nothing to do with the test).

I still strongly suspect that we had not picked up #47201 but I have checked twice that we ought to have done so in that build. Going to spin up another 20.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

Oh wait, CI runs this locally. Maybe that's the secret? Going to try that as well.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

I'm getting repros running locally. That's nice, I will take a look.

What sucks is that I can't easily instrument the CRDB binary, as I would need to instrument the old versions, not HEAD. So I hope to find an "obvious" problem in the roachtest.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

Every repo I get has n4 being the offender. I'm sure I messed something up when changing this test. SHOW CLUSTER SETTING version is returning the right thing, so I must be running it against the wrong node. I just added caching of *gosql.DBs, surely I managed to get that wrong.

@tbg
Copy link
Member

tbg commented Apr 9, 2020

Ok so my new theory is that the invariant I mentioned above (Gossip acks a cluster version only when it's on disk) is not upheld at startup. That is, there is a moment in time at which the cluster version comes in via gossip but the callback that writes it to disk (before putting it into gossip) is not registered yet. In that case, the node will reflect the new cluster version, but if it is promptly shut down, it won't be on disk. I can't explain everything going wrong in this test but I think the fact that it always affects n4 is because the 19.1.5 fixture for n4 doesn't reflect the 19.1 cluster version yet. I have definitely seen failures because of that (going from 2.1 to 19.2), what I understand less are the failures where n4 reflects 19.1 and we're going into HEAD.

I looked at the 19.1 code and it's inscrutable (thanks @andreimatei for rewriting it, much better now). But the bottom line is that there's always a hole where a cluster version bump could sneak in before we are setting the callback that is supposed to fire before every bump. This "hole" can be plugged and I will do so.

This won't be easy on the old versions, so I won't try. However, I'll make sure the test doesn't hit it any more.

tbg added a commit to tbg/cockroach that referenced this pull request Apr 9, 2020
The recent changes to this test seem to have uncovered a latent [bug],
as a result of which the test was skipped. This commit lets the test
start from v19.2 fixtures (instead of v19.1 before) which experimentally
were verified to not exhibit the bug, so the test is unskipped.

While I was there, I automated the base version discovery of this test.
On a branch at version X, the test will now use the fixtures for the
last major release preceding X. We'll occasionally have to bake new
fixtures and update the mapping in PredecessorVersion() but this is
better than having the test rot on older branches whenever we update
it for a new dev cycle.

The root cause of the bug will be fixed separately; this is tracked in
issue cockroachdb#44732.

[bug]: cockroachdb#47235 (comment)

Release note: None
@spaskob spaskob closed this Apr 9, 2020
craig bot pushed a commit that referenced this pull request Apr 9, 2020
47232: opt: corrects function doc for canMaybeConstrainIndex r=mgartner a=mgartner

#### opt: corrects function doc for canMaybeConstrainIndex

This commit corrects a the function documentation for
canMaybeConstrainIndex.

Prior to this change, it did not mention a third check, for tight filter
constraints, that is performed to determinte the possibility that an
index could be constrained by a filer.

Also, the documentation now correctly maps to the logic of the function.
Previously it falsly claimed that if any of the checks were false then
the function would return false. Now it correctly states that if any of
the checks are true, then the fucntion returns true.

Release note: None



47268: roachtest: improve and de-flake version-upgrade r=spaskob a=tbg

roachtest: automate and de-flake version-upgrade

The recent changes to this test seem to have uncovered a latent [bug],
as a result of which the test was skipped. This commit lets the test
start from v19.2 fixtures (instead of v19.1 before) which experimentally
were verified to not exhibit the bug, so the test is unskipped.

While I was there, I automated the base version discovery of this test.
On a branch at version X, the test will now use the fixtures for the
last major release preceding X. We'll occasionally have to bake new
fixtures and update the mapping in PredecessorVersion() but this is
better than having the test rot on older branches whenever we update
it for a new dev cycle.

The root cause of the bug will be fixed separately; this is tracked in
issue #44732.

[bug]: #47235 (comment)

Release note: None


Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Apr 10, 2020
A central invariant around cluster versions is that when an update
arrives, we need to persist it before exposing it through the version
setting. This was not true during server start time, as described in:

cockroachdb#47235 (comment)

In short, we were registering the callback to persist to the engines
*after* Gossip had already connected, opening up a window during which
a cluster version bump simply would not be persisted.

In the acceptance/version-upgrade test, this would manifest as nodes
refusing to start because their binary version had proceeded to far
beyond the persisted version. At the time of writing and before this
commit, this would happen in perhaps 1-10% of local runs on a linux
machine (rarely on OSX).

The buggy code was originally written when the startup sequence was a
lot more intricate and we were, as I recall, under pressure to deliver.
Now, with recent refactors around the `initServer`, we're in a
really good place to solve this while also simplifying the whole
story. In this commit,

- persist the cluster version on *all* engines, not just the initialized
  ones; this completely decouples the timing of when the engines get
  initialized from when we can set up the persistence callback and
  makes everything *much simpler*.
- stop opportunistically backfilling the cluster version. It is now
  done once at the beginning, in the right place, without FUD later
  on.
- remove the cluster version persistence from engine initialization.
  Anyone initializing an engine must put a cluster version on it
  first. In a running server, this happens during initServer creation
  time, way before there are any moving parts in the system. In tests,
  extra writes were added as needed.
- set up the callback with Gossip before starting Gossip, and make sure
  (via an assertion) that this property does not rot.
  By setting up the callback before Gossip starts, we make sure there
  isn't a period during which Gossip receives an update but doesn't have
  the callback yet.
- as a result of all of the above, take all knowledge of cluster version
  init away from `*Node` and `*Stores`.

As a last note, we are planning to stop using Gossip for this version
business in 20.1, though this too will be facilitated by this change.

Release note (bug fix): Avoid a condition during rapid version upgrades
where a node would refuse to start, claiming "[a store is] too old for
running version". Before this bug fix, the workaround is to decommission
the node, delete the store directory, and re-add it to the cluster as
a new node.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 14, 2020
A central invariant around cluster versions is that when an update
arrives, we need to persist it before exposing it through the version
setting. This was not true during server start time, as described in:

cockroachdb#47235 (comment)

In short, we were registering the callback to persist to the engines
*after* Gossip had already connected, opening up a window during which
a cluster version bump simply would not be persisted.

In the acceptance/version-upgrade test, this would manifest as nodes
refusing to start because their binary version had proceeded to far
beyond the persisted version. At the time of writing and before this
commit, this would happen in perhaps 1-10% of local runs on a linux
machine (rarely on OSX).

The buggy code was originally written when the startup sequence was a
lot more intricate and we were, as I recall, under pressure to deliver.
Now, with recent refactors around the `initServer`, we're in a
really good place to solve this while also simplifying the whole
story. In this commit,

- persist the cluster version on *all* engines, not just the initialized
  ones; this completely decouples the timing of when the engines get
  initialized from when we can set up the persistence callback and
  makes everything *much simpler*.
- stop opportunistically backfilling the cluster version. It is now
  done once at the beginning, in the right place, without FUD later
  on.
- remove the cluster version persistence from engine initialization.
  Anyone initializing an engine must put a cluster version on it
  first. In a running server, this happens during initServer creation
  time, way before there are any moving parts in the system. In tests,
  extra writes were added as needed.
- set up the callback with Gossip before starting Gossip, and make sure
  (via an assertion) that this property does not rot.
  By setting up the callback before Gossip starts, we make sure there
  isn't a period during which Gossip receives an update but doesn't have
  the callback yet.
- as a result of all of the above, take all knowledge of cluster version
  init away from `*Node` and `*Stores`.

As a last note, we are planning to stop using Gossip for this version
business in 20.1, though this too will be facilitated by this change.

Release note (bug fix): Avoid a condition during rapid version upgrades
where a node would refuse to start, claiming "[a store is] too old for
running version". Before this bug fix, the workaround is to decommission
the node, delete the store directory, and re-add it to the cluster as
a new node.
craig bot pushed a commit that referenced this pull request Apr 14, 2020
47358: server: rework cluster version initialization r=irfansharif a=tbg

A central invariant around cluster versions is that when an update
arrives, we need to persist it before exposing it through the version
setting. This was not true during server start time, as described in:

#47235 (comment)

In short, we were registering the callback to persist to the engines
*after* Gossip had already connected, opening up a window during which
a cluster version bump simply would not be persisted.

In the acceptance/version-upgrade test, this would manifest as nodes
refusing to start because their binary version had proceeded to far
beyond the persisted version. At the time of writing and before this
commit, this would happen in perhaps 1-10% of local runs on a linux
machine (rarely on OSX).

The buggy code was originally written when the startup sequence was a
lot more intricate and we were, as I recall, under pressure to deliver.
Now, with recent refactors around the `initServer`, we're in a
really good place to solve this while also simplifying the whole
story. In this commit,

- persist the cluster version on *all* engines, not just the initialized
  ones; this completely decouples the timing of when the engines get
  initialized from when we can set up the persistence callback and
  makes everything *much simpler*.
- stop opportunistically backfilling the cluster version. It is now
  done once at the beginning, in the right place, without FUD later
  on.
- remove the cluster version persistence from engine initialization.
  Anyone initializing an engine must put a cluster version on it
  first. In a running server, this happens during initServer creation
  time, way before there are any moving parts in the system. In tests,
  extra writes were added as needed.
- set up the callback with Gossip before starting Gossip, and make sure
  (via an assertion) that this property does not rot.
  By setting up the callback before Gossip starts, we make sure there
  isn't a period during which Gossip receives an update but doesn't have
  the callback yet.
- as a result of all of the above, take all knowledge of cluster version
  init away from `*Node` and `*Stores`.

As a last note, we are planning to stop using Gossip for this version
business in 20.1, though this too will be facilitated by this change.

Release note (bug fix): Avoid a condition during rapid version upgrades
where a node would refuse to start, claiming "[a store is] too old for
running version". Before this bug fix, the workaround is to decommission
the node, delete the store directory, and re-add it to the cluster as
a new node.

47457: clusterversion: introduce 20.2 development versions r=dt a=dt

Release note: none.

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants