Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: reduce over-splitting during RESTORE #86496

Merged
merged 2 commits into from
Sep 2, 2022

Conversation

dt
Copy link
Member

@dt dt commented Aug 19, 2022

When constructing the spans to restore, we can produce fewer, wider spans
that each contain more files to avoid splitting as much.

Previously we split at the bounds of every file in the base backup, then
added incremental layer files to any span they overlapped. This changes
that slightly to keep appending files that would create a new span at the
right-hand-side of the span covering to instead extend current rightmost
entry in the covering if the total size of what has been added to it is
still below our target size per span (384mb).

Release justification: bug fix.

Release note (bug fix): reduce the amount that RESTORE over-splits.

@dt dt requested review from stevendanna, msbutler and a team August 19, 2022 22:33
@dt dt requested a review from a team as a code owner August 19, 2022 22:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt
Copy link
Member Author

dt commented Aug 19, 2022

Here's a of the same backup we use in the restore2tb backup -- with is roughly 660GB data size -- being restored with and without (an earlier version of) this patch. Note the axis labels -- the master cluster created 21600 ranges, while the merging spans cluster created 2200 ranges (and later added another ~90 from background splits after the restore of ranges we'd overfilled):

master:
Screen Shot 2022-08-19 at 6 34 26 PM

patched:
Screen Shot 2022-08-19 at 6 34 11 PM

I think I was using a slightly different target size and lower guess sizes for that run (I believe they were 448 and 8mb, but I could be wrong) than are in the diff now, so a re-run could be good but I'm about out of time at this for the week so figured I'd post the initial result.

I did notice some rebalancing traffic on the merged spans cluster however, indicating we were not splitting quite enough to avoid any post-ingest rebalance:
Screen Shot 2022-08-19 at 6 46 58 PM
I suspect we'd benefit from enabling as-we-fill splits in the sst batcher to avoid these post-ingest rebalances.
The total amount of rebalancing however isn't all that concerning and didn't have a noticeable impact on top-line restore time: 41:00 on the control master, and 40:17 on the experimental merging build.

The merging build's rebalancing did mean that final disk usage on that cluster was 2.1% higher than the master one where everything went straight to L6 and never moved, but again, not a big enough difference to worry me and it'll compact away with time.

@msbutler
Copy link
Collaborator

wow, beat me to it! I've got another impl that seems slightly more involved here. I'll plan to dig a little deeper here next week.

@msbutler
Copy link
Collaborator

msbutler commented Aug 19, 2022

will also consider how this jives with @itsbilal 's pr

@dt
Copy link
Member Author

dt commented Aug 19, 2022

@msbutler took a peek at your version, but I think I prefer just widening the rightmost cover span as we go as I do here, vs needing to mess with the the covering after we make it. I think I'd suggest you take this PR over instead, but up to you.

I think the main thing I'd like to do here to feel better about merging it is to add another test case where the target size isn't zero, and what restore passes a byte-size cluster setting.

I'd also like to enable as-we-go-splits in the batcher, but I think my roachtest run shows that that isn't a strict requirement to merge this as is.

On the level merging iter PR: I'm not really sure this diff changes things w.r.t that? This is putting more Files into each restore span, sure, but there was already no real guarantees about sstables -- not files -- not overlapping in a level of restore span (indeed they overlap all the time within each incremental layer). The only difference is previously in the first level we'd never assign multiple Files to the same span, but now we will, but we can and would often assign multiple files to the same Span in inc levels, so there was already the potential for overlapping sstables (since even thought the files are non-overlapping, they can be in different sstables which are overlapping), right?

@msbutler
Copy link
Collaborator

msbutler commented Aug 22, 2022

ah rats, Bilal's PR cannot get applied to RESTORE. I was conflating files and SSTable in makeSimpleImportSpans. It's a bit confusing that file =! sstable. This has tripped me up several times.

@msbutler
Copy link
Collaborator

msbutler commented Aug 22, 2022

The high level difference between David's implementation and mine is:

  • my implementation expands the spans after the restoreSpanEntry has formed, which provides an upper bound on the amount of work in a span. The upper bound prevents mid/post RESTORE range splits (full of data).
  • David's expands the spans during restoreSpanEntry formation and only when evaluating the 0th layer (the full backup spans). This impl cannot guarantee an upper bound of work on a given span. So, if incremental backups add significantly more work to a span, this implementation could cause mid/post restore range splits, which are much more expensive than the mid/post restore range merges that we're currently dealing with.

Given all this, I think it makes sense to go with david's lighter weight implementation, as long as we also implement range splitting in the batcher, which we were planning to do anyway. Given how late we are in the release, I don't think any of this should be enabled by default.

@msbutler msbutler self-assigned this Aug 22, 2022
@msbutler msbutler assigned dt and unassigned msbutler Aug 30, 2022
@msbutler
Copy link
Collaborator

@dt per weekly standup, I'm handing this back to you.

@dt dt requested a review from a team as a code owner August 30, 2022 23:59
@dt
Copy link
Member Author

dt commented Aug 31, 2022

Added another commit to enable as-we-fill splits of full ranges, replacing the disabling of batcher's splits during restore more narrow, to be just a disabling of scatters.
Results look promising (note the axis labels i.e the magnitude of these are not comparable):

master, all pre-split and very overspilt (35min, 21602 ranges)
Screen Shot 2022-08-30 at 9 11 33 PM

just bigger spans, no as-we-go splits (32min, 1976 ranges):
Screen Shot 2022-08-30 at 9 11 15 PM

bigger spans plus enabling as-we-go-splits (29min, 1953 ranges):
Screen Shot 2022-08-30 at 9 11 48 PM

@dt
Copy link
Member Author

dt commented Aug 31, 2022

how late we are in the release, I don't think any of this should be enabled by default.

I'm tempted to enable it, and potentially even backport too, to fix the bad behavior of over-splitting, at least to 22.1. I'm not that scared of it, since we're just changing how we divide the spans, not what we do with them or how we do it really.

@shermanCRL
Copy link
Contributor

how late we are in the release, I don't think any of this should be enabled by default.

I'm tempted to enable it, and potentially even backport too, to fix the bad behavior of over-splitting, at least to 22.1. I'm not that scared of it, since we're just changing how we divide the spans, not what we do with them or how we do it really.

You can make the call, I’m ok with it if the tests are good (including the cluster testing we plan to do).

Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two high level worries about this diff (that are perhaps off base):
First, this diff makes it easier for one node to end up processing significantly more data, relative to other nodes in the flow. This has a couple risky consequences:

  1. the restore runtime can get bottlenecked by one node
  2. kv may feel compelled to move full ranges from the hot node, which is probably more expensive than merging small ranges

Here's one way this could happen, even with range splitting enabled:

  • suppose the incremental backups add significantly more data to a single restoreSpanEntry that makeSimpleImportSpans has grown, as the full backup contained relatively less data in this span. Also suppose that before this diff, the data from the incremental backups would have been divided into several restoreSpanEntry's that also got divided over several import span chunks.
  • before this diff, the incremental backup data was divided across chunks, and therefore, work was distributed across the cluster.
  • with this diff, it is more likely that al the incremental backup data will end up in fewer chunks, and therefore get processed by fewer nodes.

Second, given that this diff does not change our checkpointing procedure (something we said should happen after this work), the larger spans this work creates makes checkpointing more imprecise, which could make restore job pausing/resuming more expensive. (i.e. #81116, though disallowShadowing may have been disabled? ).

pkg/ccl/backupccl/restore_span_covering.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved
@dt
Copy link
Member Author

dt commented Aug 31, 2022

suppose the incremental backups add significantly more data to a single restoreSpanEntry that makeSimpleImportSpans has grown, as the full backup contained relatively less data in this span

I thought a little about this initially, but came away thinking this is true anyway, even without merging base spans at all, so we shouldn't let it stop us from fixing the issue at hand. If we think about when we might hit this case -- where the base backup has very little data in some span but then later layers have much more in that span, that is going to be most common when you create a table, back it up, then load data into it, i.e. many ranges of data, then backing up incrementally; all those big, full ranges of the inc backup all fall within the one span in the inc backup. This however is all true without merging -- it's a single span per table when it is empty -- so the only difference with merging is magnitude, but my thesis here is that magnitudes that are unfortunate are reachable anyway.

If we decide that we care about this -- I'm not sure we really do right now -- then I think we'd want to do something more general, that post-processes the covering if any one chunk is more than X% (probably 1/nodes*2 or something), but I don't really see it as coupled to merging small base spans.

does not change our checkpointing procedure (something we said should happen after this work)

I don't think we should make extra spans here just for the sake of progress checkpoints since we already know that that does not actually make progress lose that much less; our losses are already due to out-of-order span completion, not individual span sizes, and even post-merging our spans sizes here are still, in absolute numbers, generally small. I don't really find the progress effects that compelling: this was always intended to have ~range sized chunks, is currently dramatically undershooting that, and 8mb or 64mb or 364mb or even 3gb are all tiny numbers in TB restores, where we really care about progress. In a tiny restore where the difference here is actually a real % of the total size, then the total size is small enough to not really care about progress much anyway since it is small anyway (also there's a minimum time between progress updates too).

@msbutler
Copy link
Collaborator

msbutler commented Aug 31, 2022

Cool, I'm no longer worried about this diff creating an uneven distribution of work.

Sorry, I should have been more clear on why I worry about checkpointing. Like you, I don't care that much about the additional lost work this diff would create. I worry that the larger spans will cause Restore to slow to crawl more frequently due to CheckForSSTCollisions. In 22.1 and 21.2, we set disallowShadowing to false. Perhaps in this PR or after the branch cut, we should set disallowShadowing to false in 22.2 as well.

@dt
Copy link
Member Author

dt commented Aug 31, 2022

The lost-work and the disallowShadowing shadowing overlaps are one and the same though, aren't they? The spans where disallowShadowing gets you are the non-empty spans that you previously filled, i.e. the lost work. My point above is that the main driver of spans we'd fill a second time isn't the chunk size, but the ordering-sensitive checkpoint, so I don't care about growing the chunk size that much. If it were just one 384mb span per proc that had to do expensive disallowShadowing checks on resume, that'd still be completely fine with me. The problem today, as I understand it, is we do hundreds of chunks again, don't we?

@msbutler
Copy link
Collaborator

msbutler commented Aug 31, 2022

The lost-work and the disallowShadowing shadowing overlaps are one and the same though, aren't they?

Yes, both are caused by our naive ordering-sensitive checkpointing strategy, which, as you say, causes us to reingest a bunch of chunks again. My point is: if the avg span entry size is significantly bigger and the number of span entries retried remains constant (not sure if this is correct), then this diff means we're more susceptible to the check for key collisions problem. Regardless, I think we should set disallowShadowing == false for 22.2, as we have not yet updated our checkpointing scheme, which means my point is moot.

Previously sstBatcher had a boolean to control whether it would split
and then scatter a range it was filling when it could determine that it
was about to overfill it i.e. if had previously reported its remaining
capacity to be less than what the batcher was about to add. RESTORE set
this boolean to disable this splitting and scattering as it pre-split and
pre-scattered all of its ranges and thus did not expect the batcher to
need to do any as-it-fills splitting and scattering.

However blanket disabling this makes little sense: if the batcher knows
that it is about to overfill a range, regardless of whether or not that
range was supposed to be pre-split to avoid this, it always makes sense
to split it before adding to it, rather than overfill it.

Thus this change makes as-we-fill splits unconditional, and moves the
boolean to be more narrowly scoped to just disabling the scatters, as
RESTORE still pre-scatters its pre-chunked restore spans. This paves the
way for RESTORE to make and pre-split larger chunks, without worrying if
doing so will cause overfilling of ranges.

Release note: none.

Release justification: bug fix.
@dt dt force-pushed the restore-spans branch 2 times, most recently from 5fef88b to 7b55fc4 Compare September 1, 2022 12:49
When constructing the spans to restore, we can produce fewer, wider spans
that each contain more files to avoid splitting as much.

Previously we split at the bounds of every file in the base backup, then
added incremental layer files to any span they overlapped. This changes
that slightly to keep appending files that would create a new span at the
right-hand-side of the span covering to instead extend current rightmost
entry in the covering if the total size of what has been added to it is
still below our target size per span (384mb).

Release justification: bug fix.

Release note (bug fix): reduce the amount that RESTORE over-splits.
@msbutler
Copy link
Collaborator

msbutler commented Sep 1, 2022

The non test code looks good to me. Last question: how confident are you that the test cases you put in exercise every nook and cranny of the new makeSimpleImport spans? Before backporting, I wonder if we ought to write a few more specific test cases to ensure this. makeSimpleImportSpans is no longer so simple :)

@dt
Copy link
Member Author

dt commented Sep 1, 2022

how confident are you that the test cases you put in exercise every nook and cranny of the new makeSimpleImport spans

The random test does a good job of ensuring correctness (it caught a bug here). It doesn't do much in terms of proving we merged into any specific number of spans for perf, but it proves we have the right files in the right spans I think, which makes me feel better about this.

@dt
Copy link
Member Author

dt commented Sep 1, 2022

TFTR!

bors r+

@dt
Copy link
Member Author

dt commented Sep 1, 2022

I tagged it for backport; we can see how long we want to bake on 22.2 before we merge it

@craig
Copy link
Contributor

craig bot commented Sep 1, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 2, 2022

Build succeeded:

@craig craig bot merged commit c080f7a into cockroachdb:master Sep 2, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 2, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0ee547a to blathers/backport-release-22.1-86496: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@dt dt deleted the restore-spans branch September 2, 2022 12:40
@dankinder
Copy link

@dt I have been waiting eagerly for this change as I'm still blocked on restoring, per #86470 so thanks for cranking it out.

I'm currently on V22.1.5. I'll either need to wait until this gets backported to 22.1, or deploy the next alpha of 22.2, or build it myself. If you have any suggestion on which is best or the timeline of when these will happen I'd appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants