-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: Support consistent, low-latency reads anywhere in the world, by making the optimizer latency-aware #31955
Comments
CC: @awoods187 |
Related: #26301. I also thought the one-index-per-zone idea had been discussed there, but now I don't see it. |
Robert Prior Mention cockroachdb/docs#3575 (comment)
|
@jordanlewis can you weigh in here in the context of #26301 ? @andy-kimball we've put #26301 on the SQL execution roadmap for 2.2. Do you think we need both? |
It seems to me that the "multiple indexes" idea is just papering over the lack of quorum reads (or some more flexible read-leases system where multiple nodes can serve reads on a single index. Doesn't it seem that we really ought to be solving this problem at a lower, infrastructe level rather than have SQL work around it? |
@andreimatei, there are a couple of difficult problems that come to mind:
Also, the question here is not whether to use multiple indexes in this way; they're already being used this way in the field. The question is whether we recognize what's already happening, and take steps to make it easier. Furthermore, I think we eventually want to make the optimizer latency-aware independent of this issue, so I don't believe that would be wasted work, even if we implemented quorum reads. Having recently done some experimentation with the geo-distributed features in CockroachDB, I think we're in danger of reaching the "abstraction breaking point", where the abstractions have become complex enough that the average developer can't understand how to effectively use them. If we reach that breaking point, only big companies with specially trained DBAs or specialized consulting teams can put together solutions. This is another reason I don't like "immutable tables", because it requires adding yet another abstraction on top of an already complex system. Instead, we should more fully exploit the surface area of the abstractions we already have. |
I hear some of the things you're saying, and it might come down to a judgement choice, but the more I think about it, the more I personally lean towards saying that we'd be well advised to address this at the KV level. I think that would result in a more usable, understandable and easy to describe/model product. Some responses:
I see this as an argument against starting to push people on a high scale towards multiple indexes like you're proposing. Yes, this proposal doesn't require technically introducing any new concepts, but it sure does introduce a bunch of complexity for users. Multiple indexes with the same data? That's a mind-blowing concept. So I need to create the indexes, then muck with some new zone configs, then somehow ensure that all the pieces to the puzzle (index, zone config, optimizer using the right index) fell into place. And then if one of my regions goes away, what then? I need to remember to delete an index, do something with the zone configs, etc. To make this kind of stuff semi-seamless, I think we'd need some more serious integration of the feature into SQL (e.g. find a way to tie all indexes and their zone config into a table zone config, or something). Contrast with having a KV-level feature that, one way or another, allows multiple nodes to serve reads for a range. You might call this a "new abstraction", but I'd call it a natural extension of the existing model - where people need to understand that one node is special. Multiple nodes serving reads makes the model simpler, not more complicated. One already needs to understand what nodes are involved in a write; there's not much new stuff to understand.
Agreed. If the only thing being discussed here is whether the optimizer should understand index data placement, than I definitely wouldn't argue against that. But if that's all there is to this issue, then we might want to rename it more narrowly :P
This I tend to agree with, at least w.r.t. the form of that proposal that I remember. |
As you know, I've always been an advocate of enabling multiple read replicas at the consensus level. I agree that we can overcome the challenges I list (significant changes to core and reconciliation with existing abstractions). However, I also believe this will be a major effort that requires multiple releases. Do you agree or disagree with that? I don't believe there needs to be an "either-or" choice here, as you seem to be arguing. Longer-term, we should support multiple read replicas at the consensus level. But I think it's a mistake to not simultaneously look at the shorter-term, specifically at how the product is already being used, and how we might better support that usage at a much lower development cost than would be required with the longer-term approach. Furthermore, if this alternate approach allowed us to avoid a problematic stop-gap feature (immutable tables), that makes an even stronger argument we should consider it. I also modified the title of the issue to make it more clear that this issue is about adding latency awareness in the optimizer. |
I don't know, to be honest; haven't thought about it enough to have a good opinion. I'm thinking about it now for 3 minutes and I don't know exactly how I would do it, so it certainly doesn't appear to be the easiest thing. "Multiple releases" still sounds high to me.
I think we're not exactly in disagreement. If the only thing being discussed is whether the optimizer should be aware of index locality, then that's unequivocally a good thing. If, on the other hand, we were to do more for making this multiple-indexes trick a first class citizen (e.g. adding dedicated support in index creation syntax or in table zone config language or even perhaps certain types of documentation), there'd probably be a line after which I, for one, would probably get off the bus.
Well, here I start feeling uneasy; yours is a political, not a technical statement, and it suggests moving these indexes into "first class citizen" territory. The particular problem of small, read-mostly, non-partitionable tables used as FKs by partitioned tables I think is fairly wide-spread and acute, and it deserves a documented solution in a relatively short term. I'm not convinced the solution we advertise for distributing these tables is the definition of multiple indexes on them. I also don't particularly like the proposal in #26301. I have my own controversial idea on the topic, but I'll discuss it separately. |
TLDR: I've done a 180 and now love this idea. Except that I think that the indexes in question should be seen as an implementation detail and hidden from the user. Read requirements should be expressed in a table's zone config and indexes for all the columns created automatically (and hidden).
After talking about it a little over dinner, I now think that this would indeed be a bigger deal than I was inclined to believe before. I now think that Raft, with its majority vote focus, is fundamentally unsuitable for what we'd need. So then I did some soul searching to figure out what exactly about this present proposal rubs me the wrong way. I believe the root of the problem, as I see it, is that multiple indexes containing the same data is not something one should have to do. An index, as a first class SQL concept, is defined by its data. How can one have multiple indexes with the same data? That can't be what one wants. Plus a million other problems that fall from this awkward construct: what happens when the cluster's localities change? What happens when one wants to add a column to the table whose indexes are supposed to contain all the columns? But then I learned to stop worrying and love this idea, but with a twist. Hear me out. So basically before I was expressing concerns that this proposal is a slippery slope and we might go to far. But now I think I see the light - it wasn't going far enough! Asking the user to create a bunch of identical indexes is the uncanny valley. If these indexes are seen just as an implementation detail of the goal of distributing writes, and so hidden from the user instead of in the user's face, then conceptually this makes perfect sense. |
One thing that occured to me is that, besides teaching the optimizer what index to use for |
In 2.2, the optimizer will be planning DML statements so this should fall at naturally (🤞 ). |
I was discussing these issues with @knz and he brought up a thought that I've been having as well: we'd probably be well-advised to move the SQL module away from considering a PK special in any way. Currently, the distinction between a table's PK and its indexes is pervasive in SQL code. The PK has a different data layout than the indexes (e.g. it alone supports column families), and it alone is considered for FK checks and for things like index joins (unless this changed recently) but there's no particular reason for any of that other than path-dependent evolution. |
This is part of the 19.1 release for both read-only and mutation statements. We still require the user to create manual identical indexes, so in a future release we should consider how to make this more intuitive/automatic. I'm closing this issue, since the original goal has been reached. See our docs for more information. |
I think @RaduBerinde was just telling me that FK checks do not yet choose what index to use. No? |
This was mostly about reads which don't involve any FK work. |
This idea was suggested by @drewdeally, who is already using it to address real customer scenarios. However, it currently requires application changes and manual effort. A much better experience could be made possible by adding optimizer support.
App Scenario
Most applications have write-rarely, read-mostly tables that contain "reference data". Reference data is often needed globally, and yet must have low read latency, as it is frequently joined with other tables that are more dynamic. Examples of reference data are a zip code table, a product catalog table, or a holiday calendar table.
Problem
In CockroachDB, a table can be replicated any number of times. However, for any given range of data within the table, only one node (the leaseholder) can serve consistent data to readers. This presents a difficult problem in geo-distributed scenarios; while nodes located near the leaseholder have low read latency, nodes further away can have latencies >100ms (e.g. US to Asia latency). This means that if a query joins to a reference table with a remote leaseholder, the query's latency can never be less than the RTT to that leaseholder.
Solution
CockroachDB already allows table indexes to be pinned to different geo localities (i.e. Europe, US, etc). Therefore, multiple identical indexes can be defined on the same table, where each index is pinned to a different locality.
Then, each index is pinned to nodes in the corresponding locality, using replication zone configuration (requires Enterprise license to create per-index zones).
Now queries that are run in the US can join to
products@us_idx
, while queries run in Europe can join toproducts@eu_idx
. Both queries experience local latencies. Furthermore, the reads are consistent, meaning that they always see the latest updates to the products table.The tradeoff for the low-latency reads are more secondary indexes to store and update when writes occur.
Optimizer Support
The manual version of this capability requires the application to explicitly specify the right index to join against. This requires the application to know what region it's running in, and then to generate a different version of the query based on that.
The optimizer could be upgraded to incorporate latencies into its cost model. When multiple identical indexes are available, the optimizer would select the index having the nearest leaseholder. So the same query could use a different index depending on which Gateway received it. No changes to the application would be required; it can remain locality-agnostic.
CC: @petermattis, @RaduBerinde
The text was updated successfully, but these errors were encountered: