-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: reduce over-splitting during RESTORE #86496
Conversation
wow, beat me to it! I've got another impl that seems slightly more involved here. I'll plan to dig a little deeper here next week. |
@msbutler took a peek at your version, but I think I prefer just widening the rightmost cover span as we go as I do here, vs needing to mess with the the covering after we make it. I think I'd suggest you take this PR over instead, but up to you. I think the main thing I'd like to do here to feel better about merging it is to add another test case where the target size isn't zero, and what restore passes a byte-size cluster setting. I'd also like to enable as-we-go-splits in the batcher, but I think my roachtest run shows that that isn't a strict requirement to merge this as is. On the level merging iter PR: I'm not really sure this diff changes things w.r.t that? This is putting more |
ah rats, Bilal's PR cannot get applied to RESTORE. I was conflating |
The high level difference between David's implementation and mine is:
Given all this, I think it makes sense to go with david's lighter weight implementation, as long as we also implement range splitting in the batcher, which we were planning to do anyway. Given how late we are in the release, I don't think any of this should be enabled by default. |
@dt per weekly standup, I'm handing this back to you. |
I'm tempted to enable it, and potentially even backport too, to fix the bad behavior of over-splitting, at least to 22.1. I'm not that scared of it, since we're just changing how we divide the spans, not what we do with them or how we do it really. |
You can make the call, I’m ok with it if the tests are good (including the cluster testing we plan to do). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two high level worries about this diff (that are perhaps off base):
First, this diff makes it easier for one node to end up processing significantly more data, relative to other nodes in the flow. This has a couple risky consequences:
- the restore runtime can get bottlenecked by one node
- kv may feel compelled to move full ranges from the hot node, which is probably more expensive than merging small ranges
Here's one way this could happen, even with range splitting enabled:
- suppose the incremental backups add significantly more data to a single
restoreSpanEntry
thatmakeSimpleImportSpans
has grown, as the full backup contained relatively less data in this span. Also suppose that before this diff, the data from the incremental backups would have been divided into severalrestoreSpanEntry
's that also got divided over several import span chunks. - before this diff, the incremental backup data was divided across chunks, and therefore, work was distributed across the cluster.
- with this diff, it is more likely that al the incremental backup data will end up in fewer chunks, and therefore get processed by fewer nodes.
Second, given that this diff does not change our checkpointing procedure (something we said should happen after this work), the larger spans this work creates makes checkpointing more imprecise, which could make restore job pausing/resuming more expensive. (i.e. #81116, though disallowShadowing may have been disabled? ).
I thought a little about this initially, but came away thinking this is true anyway, even without merging base spans at all, so we shouldn't let it stop us from fixing the issue at hand. If we think about when we might hit this case -- where the base backup has very little data in some span but then later layers have much more in that span, that is going to be most common when you create a table, back it up, then load data into it, i.e. many ranges of data, then backing up incrementally; all those big, full ranges of the inc backup all fall within the one span in the inc backup. This however is all true without merging -- it's a single span per table when it is empty -- so the only difference with merging is magnitude, but my thesis here is that magnitudes that are unfortunate are reachable anyway. If we decide that we care about this -- I'm not sure we really do right now -- then I think we'd want to do something more general, that post-processes the covering if any one chunk is more than X% (probably 1/nodes*2 or something), but I don't really see it as coupled to merging small base spans.
I don't think we should make extra spans here just for the sake of progress checkpoints since we already know that that does not actually make progress lose that much less; our losses are already due to out-of-order span completion, not individual span sizes, and even post-merging our spans sizes here are still, in absolute numbers, generally small. I don't really find the progress effects that compelling: this was always intended to have ~range sized chunks, is currently dramatically undershooting that, and 8mb or 64mb or 364mb or even 3gb are all tiny numbers in TB restores, where we really care about progress. In a tiny restore where the difference here is actually a real % of the total size, then the total size is small enough to not really care about progress much anyway since it is small anyway (also there's a minimum time between progress updates too). |
Cool, I'm no longer worried about this diff creating an uneven distribution of work. Sorry, I should have been more clear on why I worry about checkpointing. Like you, I don't care that much about the additional lost work this diff would create. I worry that the larger spans will cause Restore to slow to crawl more frequently due to CheckForSSTCollisions. In 22.1 and 21.2, we set |
The lost-work and the |
Yes, both are caused by our naive ordering-sensitive checkpointing strategy, which, as you say, causes us to reingest a bunch of chunks again. My point is: if the avg span entry size is significantly bigger and the number of span entries retried remains constant (not sure if this is correct), then this diff means we're more susceptible to the check for key collisions problem. Regardless, I think we should set |
Previously sstBatcher had a boolean to control whether it would split and then scatter a range it was filling when it could determine that it was about to overfill it i.e. if had previously reported its remaining capacity to be less than what the batcher was about to add. RESTORE set this boolean to disable this splitting and scattering as it pre-split and pre-scattered all of its ranges and thus did not expect the batcher to need to do any as-it-fills splitting and scattering. However blanket disabling this makes little sense: if the batcher knows that it is about to overfill a range, regardless of whether or not that range was supposed to be pre-split to avoid this, it always makes sense to split it before adding to it, rather than overfill it. Thus this change makes as-we-fill splits unconditional, and moves the boolean to be more narrowly scoped to just disabling the scatters, as RESTORE still pre-scatters its pre-chunked restore spans. This paves the way for RESTORE to make and pre-split larger chunks, without worrying if doing so will cause overfilling of ranges. Release note: none. Release justification: bug fix.
5fef88b
to
7b55fc4
Compare
When constructing the spans to restore, we can produce fewer, wider spans that each contain more files to avoid splitting as much. Previously we split at the bounds of every file in the base backup, then added incremental layer files to any span they overlapped. This changes that slightly to keep appending files that would create a new span at the right-hand-side of the span covering to instead extend current rightmost entry in the covering if the total size of what has been added to it is still below our target size per span (384mb). Release justification: bug fix. Release note (bug fix): reduce the amount that RESTORE over-splits.
The non test code looks good to me. Last question: how confident are you that the test cases you put in exercise every nook and cranny of the new makeSimpleImport spans? Before backporting, I wonder if we ought to write a few more specific test cases to ensure this. |
The random test does a good job of ensuring correctness (it caught a bug here). It doesn't do much in terms of proving we merged into any specific number of spans for perf, but it proves we have the right files in the right spans I think, which makes me feel better about this. |
TFTR! bors r+ |
I tagged it for backport; we can see how long we want to bake on 22.2 before we merge it |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0ee547a to blathers/backport-release-22.1-86496: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
@dt I have been waiting eagerly for this change as I'm still blocked on restoring, per #86470 so thanks for cranking it out. I'm currently on V22.1.5. I'll either need to wait until this gets backported to 22.1, or deploy the next alpha of 22.2, or build it myself. If you have any suggestion on which is best or the timeline of when these will happen I'd appreciate it. |
When constructing the spans to restore, we can produce fewer, wider spans
that each contain more files to avoid splitting as much.
Previously we split at the bounds of every file in the base backup, then
added incremental layer files to any span they overlapped. This changes
that slightly to keep appending files that would create a new span at the
right-hand-side of the span covering to instead extend current rightmost
entry in the covering if the total size of what has been added to it is
still below our target size per span (384mb).
Release justification: bug fix.
Release note (bug fix): reduce the amount that RESTORE over-splits.