backupccl: reduce over-splitting during RESTORE #86496

dt · 2022-08-19T22:33:43Z

When constructing the spans to restore, we can produce fewer, wider spans
that each contain more files to avoid splitting as much.

Previously we split at the bounds of every file in the base backup, then
added incremental layer files to any span they overlapped. This changes
that slightly to keep appending files that would create a new span at the
right-hand-side of the span covering to instead extend current rightmost
entry in the covering if the total size of what has been added to it is
still below our target size per span (384mb).

Release justification: bug fix.

Release note (bug fix): reduce the amount that RESTORE over-splits.

cockroach-teamcity · 2022-08-19T22:33:57Z

This change is

dt · 2022-08-19T22:50:15Z

Here's a of the same backup we use in the restore2tb backup -- with is roughly 660GB data size -- being restored with and without (an earlier version of) this patch. Note the axis labels -- the master cluster created 21600 ranges, while the merging spans cluster created 2200 ranges (and later added another ~90 from background splits after the restore of ranges we'd overfilled):

master:

patched:

I think I was using a slightly different target size and lower guess sizes for that run (I believe they were 448 and 8mb, but I could be wrong) than are in the diff now, so a re-run could be good but I'm about out of time at this for the week so figured I'd post the initial result.

I did notice some rebalancing traffic on the merged spans cluster however, indicating we were not splitting quite enough to avoid any post-ingest rebalance:

I suspect we'd benefit from enabling as-we-fill splits in the sst batcher to avoid these post-ingest rebalances.
The total amount of rebalancing however isn't all that concerning and didn't have a noticeable impact on top-line restore time: 41:00 on the control master, and 40:17 on the experimental merging build.

The merging build's rebalancing did mean that final disk usage on that cluster was 2.1% higher than the master one where everything went straight to L6 and never moved, but again, not a big enough difference to worry me and it'll compact away with time.

msbutler · 2022-08-19T22:57:53Z

wow, beat me to it! I've got another impl that seems slightly more involved here. I'll plan to dig a little deeper here next week.

msbutler · 2022-08-19T22:59:04Z

will also consider how this jives with @itsbilal 's pr

dt · 2022-08-19T23:41:00Z

@msbutler took a peek at your version, but I think I prefer just widening the rightmost cover span as we go as I do here, vs needing to mess with the the covering after we make it. I think I'd suggest you take this PR over instead, but up to you.

I think the main thing I'd like to do here to feel better about merging it is to add another test case where the target size isn't zero, and what restore passes a byte-size cluster setting.

I'd also like to enable as-we-go-splits in the batcher, but I think my roachtest run shows that that isn't a strict requirement to merge this as is.

On the level merging iter PR: I'm not really sure this diff changes things w.r.t that? This is putting more Files into each restore span, sure, but there was already no real guarantees about sstables -- not files -- not overlapping in a level of restore span (indeed they overlap all the time within each incremental layer). The only difference is previously in the first level we'd never assign multiple Files to the same span, but now we will, but we can and would often assign multiple files to the same Span in inc levels, so there was already the potential for overlapping sstables (since even thought the files are non-overlapping, they can be in different sstables which are overlapping), right?

msbutler · 2022-08-22T11:57:43Z

ah rats, Bilal's PR cannot get applied to RESTORE. I was conflating files and SSTable in makeSimpleImportSpans. It's a bit confusing that file =! sstable. This has tripped me up several times.

msbutler · 2022-08-22T14:03:18Z

The high level difference between David's implementation and mine is:

my implementation expands the spans after the restoreSpanEntry has formed, which provides an upper bound on the amount of work in a span. The upper bound prevents mid/post RESTORE range splits (full of data).
David's expands the spans during restoreSpanEntry formation and only when evaluating the 0th layer (the full backup spans). This impl cannot guarantee an upper bound of work on a given span. So, if incremental backups add significantly more work to a span, this implementation could cause mid/post restore range splits, which are much more expensive than the mid/post restore range merges that we're currently dealing with.

Given all this, I think it makes sense to go with david's lighter weight implementation, as long as we also implement range splitting in the batcher, which we were planning to do anyway. Given how late we are in the release, I don't think any of this should be enabled by default.

msbutler · 2022-08-30T16:44:55Z

@dt per weekly standup, I'm handing this back to you.

dt · 2022-08-31T01:16:28Z

Added another commit to enable as-we-fill splits of full ranges, replacing the disabling of batcher's splits during restore more narrow, to be just a disabling of scatters.
Results look promising (note the axis labels i.e the magnitude of these are not comparable):

master, all pre-split and very overspilt (35min, 21602 ranges)

just bigger spans, no as-we-go splits (32min, 1976 ranges):

bigger spans plus enabling as-we-go-splits (29min, 1953 ranges):

dt · 2022-08-31T01:19:36Z

how late we are in the release, I don't think any of this should be enabled by default.

I'm tempted to enable it, and potentially even backport too, to fix the bad behavior of over-splitting, at least to 22.1. I'm not that scared of it, since we're just changing how we divide the spans, not what we do with them or how we do it really.

shermanCRL · 2022-08-31T04:00:23Z

how late we are in the release, I don't think any of this should be enabled by default.

I'm tempted to enable it, and potentially even backport too, to fix the bad behavior of over-splitting, at least to 22.1. I'm not that scared of it, since we're just changing how we divide the spans, not what we do with them or how we do it really.

You can make the call, I’m ok with it if the tests are good (including the cluster testing we plan to do).

msbutler

I have two high level worries about this diff (that are perhaps off base):
First, this diff makes it easier for one node to end up processing significantly more data, relative to other nodes in the flow. This has a couple risky consequences:

the restore runtime can get bottlenecked by one node
kv may feel compelled to move full ranges from the hot node, which is probably more expensive than merging small ranges

Here's one way this could happen, even with range splitting enabled:

suppose the incremental backups add significantly more data to a single restoreSpanEntry that makeSimpleImportSpans has grown, as the full backup contained relatively less data in this span. Also suppose that before this diff, the data from the incremental backups would have been divided into several restoreSpanEntry's that also got divided over several import span chunks.
before this diff, the incremental backup data was divided across chunks, and therefore, work was distributed across the cluster.
with this diff, it is more likely that al the incremental backup data will end up in fewer chunks, and therefore get processed by fewer nodes.

Second, given that this diff does not change our checkpointing procedure (something we said should happen after this work), the larger spans this work creates makes checkpointing more imprecise, which could make restore job pausing/resuming more expensive. (i.e. #81116, though disallowShadowing may have been disabled? ).

pkg/ccl/backupccl/restore_span_covering.go

pkg/ccl/backupccl/restore_span_covering_test.go

dt · 2022-08-31T14:25:59Z

suppose the incremental backups add significantly more data to a single restoreSpanEntry that makeSimpleImportSpans has grown, as the full backup contained relatively less data in this span

I thought a little about this initially, but came away thinking this is true anyway, even without merging base spans at all, so we shouldn't let it stop us from fixing the issue at hand. If we think about when we might hit this case -- where the base backup has very little data in some span but then later layers have much more in that span, that is going to be most common when you create a table, back it up, then load data into it, i.e. many ranges of data, then backing up incrementally; all those big, full ranges of the inc backup all fall within the one span in the inc backup. This however is all true without merging -- it's a single span per table when it is empty -- so the only difference with merging is magnitude, but my thesis here is that magnitudes that are unfortunate are reachable anyway.

If we decide that we care about this -- I'm not sure we really do right now -- then I think we'd want to do something more general, that post-processes the covering if any one chunk is more than X% (probably 1/nodes*2 or something), but I don't really see it as coupled to merging small base spans.

does not change our checkpointing procedure (something we said should happen after this work)

I don't think we should make extra spans here just for the sake of progress checkpoints since we already know that that does not actually make progress lose that much less; our losses are already due to out-of-order span completion, not individual span sizes, and even post-merging our spans sizes here are still, in absolute numbers, generally small. I don't really find the progress effects that compelling: this was always intended to have ~range sized chunks, is currently dramatically undershooting that, and 8mb or 64mb or 364mb or even 3gb are all tiny numbers in TB restores, where we really care about progress. In a tiny restore where the difference here is actually a real % of the total size, then the total size is small enough to not really care about progress much anyway since it is small anyway (also there's a minimum time between progress updates too).

msbutler · 2022-08-31T15:26:25Z

Cool, I'm no longer worried about this diff creating an uneven distribution of work.

Sorry, I should have been more clear on why I worry about checkpointing. Like you, I don't care that much about the additional lost work this diff would create. I worry that the larger spans will cause Restore to slow to crawl more frequently due to CheckForSSTCollisions. In 22.1 and 21.2, we set disallowShadowing to false. Perhaps in this PR or after the branch cut, we should set disallowShadowing to false in 22.2 as well.

dt · 2022-08-31T18:22:30Z

The lost-work and the disallowShadowing shadowing overlaps are one and the same though, aren't they? The spans where disallowShadowing gets you are the non-empty spans that you previously filled, i.e. the lost work. My point above is that the main driver of spans we'd fill a second time isn't the chunk size, but the ordering-sensitive checkpoint, so I don't care about growing the chunk size that much. If it were just one 384mb span per proc that had to do expensive disallowShadowing checks on resume, that'd still be completely fine with me. The problem today, as I understand it, is we do hundreds of chunks again, don't we?

msbutler · 2022-08-31T18:58:39Z

The lost-work and the disallowShadowing shadowing overlaps are one and the same though, aren't they?

Yes, both are caused by our naive ordering-sensitive checkpointing strategy, which, as you say, causes us to reingest a bunch of chunks again. My point is: if the avg span entry size is significantly bigger and the number of span entries retried remains constant (not sure if this is correct), then this diff means we're more susceptible to the check for key collisions problem. Regardless, I think we should set disallowShadowing == false for 22.2, as we have not yet updated our checkpointing scheme, which means my point is moot.

Previously sstBatcher had a boolean to control whether it would split and then scatter a range it was filling when it could determine that it was about to overfill it i.e. if had previously reported its remaining capacity to be less than what the batcher was about to add. RESTORE set this boolean to disable this splitting and scattering as it pre-split and pre-scattered all of its ranges and thus did not expect the batcher to need to do any as-it-fills splitting and scattering. However blanket disabling this makes little sense: if the batcher knows that it is about to overfill a range, regardless of whether or not that range was supposed to be pre-split to avoid this, it always makes sense to split it before adding to it, rather than overfill it. Thus this change makes as-we-fill splits unconditional, and moves the boolean to be more narrowly scoped to just disabling the scatters, as RESTORE still pre-scatters its pre-chunked restore spans. This paves the way for RESTORE to make and pre-split larger chunks, without worrying if doing so will cause overfilling of ranges. Release note: none. Release justification: bug fix.

pkg/ccl/backupccl/restore_span_covering.go

When constructing the spans to restore, we can produce fewer, wider spans that each contain more files to avoid splitting as much. Previously we split at the bounds of every file in the base backup, then added incremental layer files to any span they overlapped. This changes that slightly to keep appending files that would create a new span at the right-hand-side of the span covering to instead extend current rightmost entry in the covering if the total size of what has been added to it is still below our target size per span (384mb). Release justification: bug fix. Release note (bug fix): reduce the amount that RESTORE over-splits.

msbutler · 2022-09-01T18:30:45Z

The non test code looks good to me. Last question: how confident are you that the test cases you put in exercise every nook and cranny of the new makeSimpleImport spans? Before backporting, I wonder if we ought to write a few more specific test cases to ensure this. makeSimpleImportSpans is no longer so simple :)

dt · 2022-09-01T19:02:43Z

how confident are you that the test cases you put in exercise every nook and cranny of the new makeSimpleImport spans

The random test does a good job of ensuring correctness (it caught a bug here). It doesn't do much in terms of proving we merged into any specific number of spans for perf, but it proves we have the right files in the right spans I think, which makes me feel better about this.

dt · 2022-09-01T19:11:05Z

TFTR!

bors r+

dt · 2022-09-01T19:11:29Z

I tagged it for backport; we can see how long we want to bake on 22.2 before we merge it

craig · 2022-09-01T21:21:57Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-02T01:14:20Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-09-02T01:14:32Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 0ee547a to blathers/backport-release-22.1-86496: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

dankinder · 2022-09-02T14:17:59Z

@dt I have been waiting eagerly for this change as I'm still blocked on restoring, per #86470 so thanks for cranking it out.

I'm currently on V22.1.5. I'll either need to wait until this gets backported to 22.1, or deploy the next alpha of 22.2, or build it myself. If you have any suggestion on which is best or the timeline of when these will happen I'd appreciate it.

dt requested review from stevendanna, msbutler and a team August 19, 2022 22:33

dt requested a review from a team as a code owner August 19, 2022 22:33

dt mentioned this pull request Aug 19, 2022

kvserver: oom due to high replica count post-restore #86470

Closed

msbutler self-assigned this Aug 22, 2022

msbutler assigned dt and unassigned msbutler Aug 30, 2022

dt force-pushed the restore-spans branch from 94fd210 to 8da07b5 Compare August 30, 2022 23:59

dt requested a review from a team as a code owner August 30, 2022 23:59

msbutler reviewed Aug 31, 2022

View reviewed changes

pkg/ccl/backupccl/restore_span_covering.go Show resolved Hide resolved

pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved

pkg/ccl/backupccl/restore_span_covering_test.go Show resolved Hide resolved

dt force-pushed the restore-spans branch from 8da07b5 to 5d14780 Compare September 1, 2022 11:42

dt force-pushed the restore-spans branch 2 times, most recently from 5fef88b to 7b55fc4 Compare September 1, 2022 12:49

msbutler reviewed Sep 1, 2022

View reviewed changes

pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved

pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved

dt force-pushed the restore-spans branch from 7b55fc4 to 87c5d44 Compare September 1, 2022 17:30

msbutler approved these changes Sep 1, 2022

View reviewed changes

dt added the backport-22.1.x label Sep 1, 2022

craig bot merged commit c080f7a into cockroachdb:master Sep 2, 2022

dt deleted the restore-spans branch September 2, 2022 12:40

stevendanna mentioned this pull request Sep 7, 2022

roachtest: restoreTPCCInc/nodes=10 failed #87331

Closed

This was referenced Sep 20, 2022

backupccl: allow SSTBatcher to split ranges during RESTORE #83144

Closed

backupccl: teach makeSimpleImportSpans to create restore spans at a target input size #83139

Closed

adityamaru mentioned this pull request Nov 2, 2022

release-22.1: backupccl: reduce over-splitting during RESTORE #91141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: reduce over-splitting during RESTORE #86496

backupccl: reduce over-splitting during RESTORE #86496

dt commented Aug 19, 2022

cockroach-teamcity commented Aug 19, 2022

dt commented Aug 19, 2022 •

edited

Loading

msbutler commented Aug 19, 2022

msbutler commented Aug 19, 2022 •

edited

Loading

dt commented Aug 19, 2022 •

edited

Loading

msbutler commented Aug 22, 2022 •

edited

Loading

msbutler commented Aug 22, 2022 •

edited

Loading

msbutler commented Aug 30, 2022

dt commented Aug 31, 2022

dt commented Aug 31, 2022 •

edited

Loading

shermanCRL commented Aug 31, 2022

msbutler left a comment •

edited

Loading

dt commented Aug 31, 2022

msbutler commented Aug 31, 2022 •

edited

Loading

dt commented Aug 31, 2022

msbutler commented Aug 31, 2022 •

edited

Loading

msbutler commented Sep 1, 2022

dt commented Sep 1, 2022

dt commented Sep 1, 2022

dt commented Sep 1, 2022

craig bot commented Sep 1, 2022

craig bot commented Sep 2, 2022

blathers-crl bot commented Sep 2, 2022

dankinder commented Sep 2, 2022

backupccl: reduce over-splitting during RESTORE #86496

backupccl: reduce over-splitting during RESTORE #86496

Conversation

dt commented Aug 19, 2022

cockroach-teamcity commented Aug 19, 2022

dt commented Aug 19, 2022 • edited Loading

msbutler commented Aug 19, 2022

msbutler commented Aug 19, 2022 • edited Loading

dt commented Aug 19, 2022 • edited Loading

msbutler commented Aug 22, 2022 • edited Loading

msbutler commented Aug 22, 2022 • edited Loading

msbutler commented Aug 30, 2022

dt commented Aug 31, 2022

dt commented Aug 31, 2022 • edited Loading

shermanCRL commented Aug 31, 2022

msbutler left a comment • edited Loading

Choose a reason for hiding this comment

dt commented Aug 31, 2022

msbutler commented Aug 31, 2022 • edited Loading

dt commented Aug 31, 2022

msbutler commented Aug 31, 2022 • edited Loading

msbutler commented Sep 1, 2022

dt commented Sep 1, 2022

dt commented Sep 1, 2022

dt commented Sep 1, 2022

craig bot commented Sep 1, 2022

craig bot commented Sep 2, 2022

blathers-crl bot commented Sep 2, 2022

dankinder commented Sep 2, 2022

dt commented Aug 19, 2022 •

edited

Loading

msbutler commented Aug 19, 2022 •

edited

Loading

dt commented Aug 19, 2022 •

edited

Loading

msbutler commented Aug 22, 2022 •

edited

Loading

msbutler commented Aug 22, 2022 •

edited

Loading

dt commented Aug 31, 2022 •

edited

Loading

msbutler left a comment •

edited

Loading

msbutler commented Aug 31, 2022 •

edited

Loading

msbutler commented Aug 31, 2022 •

edited

Loading