Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema version number to MVCC encoded key #1772

Closed
spencerkimball opened this issue Jul 22, 2015 · 8 comments
Closed

Add schema version number to MVCC encoded key #1772

spencerkimball opened this issue Jul 22, 2015 · 8 comments

Comments

@spencerkimball
Copy link
Member

Problem:

  • After a schema change, extant queries for older schemas may run into unexpected data (e.g. a schema change to move column 'foo' from type 'string' to 'int').
  • After a schema change, extant updates for older schemas may overwrite data in bad ways (e.g. schema change to column type, dropping a column, etc.).

Proposal:

Currently we append the timestamp value to encoded keys. We could instead append <schema version><timestamp>. All operation would be augmented to include schema version, set at the gateway which processed the SQL. Reads would be serviced for any version with less-than or equal schema version and of course less than or equal timestamp. Reads with old schema versions would simply ignore newer schema versions. Writes would fail if encountering a key containing a newer schema version than the proposed schema version.

This solution allows old schema queries to proceed unhindered by potentially destructed schema changes and protects us from overwriting tuples inserted with newer schemas with data for older schemas. Additionally, this schema version on the key provides an elegant method to track and revert all changes made using each successive schema revision in turn.

@tbg
Copy link
Member

tbg commented Jul 23, 2015

Aren't online schema changes going to take care of that? If we have a registry of which node is using which schema (some table somewhere) we should be able to cut everything into a series of schema changes which can be done phase by phase, with all nodes required to move into the next phase before going further (guaranteeing that nothing incompatible with that current schema is executed any more).

We could even seamlessly allow clients to use both the old and the new schema in some situations: for instance, when converting a column type, we could re-write the AST for queries and do the conversion for requests on the old schema, while really using the new one. That would allow clients to roll over asynchronously.

@tamird
Copy link
Contributor

tamird commented Jul 23, 2015

which node is using which schema

What do you mean by node? A gateway? A node containing the data range? They are almost guaranteed to not be the same nodes.

@tbg
Copy link
Member

tbg commented Jul 23, 2015

You're right, the tricky bit is coordinating those two sides. During each phase, both schemas are valid, but switching to the newer version may require a synchronous rewrite of the whole data, which is nothing the gateway can be in charge of.
If that rewrite is (hopefully) idempotent, it should be ok for the gateways and nodes to simply enforce the latest schema. So if, for example, an index is added (in which case the first step will probably be something like making sure the index is there and up to date, but not actually use it for anything), and the gateway knows about that already but the range it writes to doesn't, it'll just update the index anyways on a write. Once the range does the migration, it'll have to go through all of its data and update the index, but no biggie.
I'm sure there are a lot more complex cases, if you have something in mind let's discuss those specifically.

@tamird
Copy link
Contributor

tamird commented Jul 23, 2015

@tschottdorf the idea of using such a table is pretty onerous. Remember that there is not just one schema in the system. If we're multitenant we will need to maintain a registry of size len(schemas) * len(nodes) in the kv map which will have the same problem as the schemas in terms of distribution and wanting to avoid consistent reads.

I think this discussion is a bit hand wavey altogether - let's revisit this when we're ready to discuss how to implement schema changes (ahead of this, we should discuss what types of schema changes we're going to allow). We should be prepared for the possibility of a storage-format-breaking-change when we do that, though.

@spencerkimball
Copy link
Member Author

I agree we can "table" this discussion for now. But if the Raft race issues have taught us anything, it's that it's helpful to version the data.

@tbg
Copy link
Member

tbg commented Jul 23, 2015

@tamird You'll persist that data anyways, so it's really just a question of where to put it. Schema is per-tenant, and a list of nodes per tenant seems fine. The table actually wouldn't have to be accessed nor updated much in normal operation. A schema change could be signaled via Gossip, and each node would update their entry with a single CPut. If you insist that all nodes which hold data for a certain tenant register themselves in that table before doing anything, you can atomically switch to the next phase (you need to deal with nodes going down during a schema change, but that'll always be an issue and a Gossip-based timeout may actually do the trick).
In any case, I also think we'll need more time to get anything serious out of this.

@spencerkimball
Copy link
Member Author

@tschottdorf there's currently no way to list nodes, and no plans to change that. Even if we did list them, that would be manual and you can imagine that a failing node would have to be manually (and somehow synchronously to all tables) delisted. This would introduce very non-trivial latency to any kind of schema changes in node failure scenarios.

@petermattis
Copy link
Collaborator

This issue was filed before we had our SQL story in place. Our SQL implementation has support for asynchronous schema changes. There is no work currently planned here, though see #1780 for discussion have handling low-level data format changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants