-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: support rolling upgrades of storage format changes #1780
Comments
from what I understood, we're not clear on that we're going to implement the StorageKey solution. @bdarnell is dusting off some of his WIP so that we can examine the issues again and weigh our options once more. |
@tschottdorf a little bit of recent discussion has put the StorageKey proposal back in vogue. The gist of the new thinking is that we're all having a really hard time convincing ourselves that the variously proposed measures actually handle the races we know about, much less the ones we don't know about. Performance is the concern with implementing the proposal. On the other hand, we need to address correctness issues most urgently. |
Even if we don't implement StorageKey, we should think before beta about how we're going to handle backwards-incompatible changes to our lowest-level storage systems, since they'll surely happen one way or another (see #1772). |
In gmail, backwards incompatible changes were performed via the replication mechanism and rolling migrations of users. I think something similar would work here and would be both general and lightweight in the code, though with a bit more work for administrators. The gist is that a node only knows how to read and write a single format. We store that format inside the rocksdb engine and barf if the on-disk format differs from what the software supports. To change the format we do rolling upgrade of the servers in the cluster: draining the data off a server, deleting the data so the server will start up empty, upgrading the software (or flipping a flag) and then bringing the server back up. We can add various complexities to this, but I much prefer this scheme to one that tries to simultaneously support multiple versions of our lowest-level formats. |
@petermattis The situation is more complicated than it was with gmail because we will have a mix of versions within a raft replica group (this is necessarily true for any zero-downtime upgrade). We can swap out different A rolling update system also implies certain work that must be done in the version before the incompatible change: we must be able to administratively drain a node and ensure that new ranges will not be rebalanced onto it; the rebalancer should be able to avoid adding new load to old-version nodes as much as possible, etc. Bigtable maintained support for its old on-disk formats and lazily converted everything to the new format as a part of the regular compaction cycle. This was also how schema changes related to locality groups were made. |
Yes, we must be able to administratively drain a node. That functionality is necessary for administration independent of handling backwards-incompatible on-disk format changes. |
The StorageKey proposal was rejected so this issue is moot. |
Does it matter that the storagekey proposal was rejected? This issue is about allowing for changing our in-rocksdb data representation - it just happened to piggyback on the storagekey name. |
If we remove the StorageKey part then there's some overlap with #2718, although I guess there are parts here that are reasonably independent (e.g. the drain functionality). I'll reopen and edit the title and first message. |
I'm closing this issue due to advanced age and the advent of prop eval kv (although that's clearly not a complete fix for the issues raised here). Let's not have open issues for things which aren't currently a problem. |
In order to support (hypothetical) storage format changes, we need to be able to drain a node, let it rewrite its database, and undrain it. We need an administrative tool to put a node in the draining state, the rebalancer must respect this state (and probably move data off a draining node faster than it otherwise would), and the admin tool needs to be able to watch for the drain to be completed.
This feature complements the version number to be added in #2718.
Original message follows:
After we implement the current StorageKey proposal meant to address the Raft races, we should figure out a general system that will allow us to change the StorageKey encoding without breaking existing deployments.
The benefits here can't be overstated; together with #629, this will allow us to provide a 1.0-esque stability guarantee without appreciably handcuffing us from making rather invasive modifications.
Tagging for beta.
The text was updated successfully, but these errors were encountered: