kvserver: oom due to high replica count post-restore #86470

dankinder · 2022-08-19T18:26:35Z

Describe the problem

I'm attempting to restore a large database from S3 backup, and if naively let it run, nodes repeatedly run out of memory and crash. It's a 40-node cluster (each 64 vcpu, 64GB ram, 6 1TB drives). The backup is 29.5 TB (original DB is ~120TB), BACKUP_MANIFEST is 518.9MB.

When I try to run it, it starts creating millions of ranges, even though the pre-backup cluster only had a few hundred thousand. A lot of CRDB memory seems to get eaten up by having this many ranges, so when one node adopts the job to run the restore, it runs out of memory pretty fast.

The situation then gets worse: the job gets automatically retried a bunch of times, and every time it creates more ranges. When it does finally fail, the GC seems to take forever, so I didn't even wait for that, I had to blow away the cluster to really try again.

Here is some of the configuration I've tried so far:

-- Settings to slow it down as much as possible
set cluster setting bulkio.ingest.sender_concurrency_limit = 1;
-- Based on noticing https://github.com/cockroachdb/cockroach/pull/76907
set cluster setting kv.bulk_io_write.restore_node_concurrency = 1;
set cluster setting cloudstorage.s3.read.node_rate_limit = '10 MiB'; -- extremely low to see if it helps

-- And ensure ranges get rebalanced
set cluster setting kv.snapshot_recovery.max_rate = '250 MiB';
set cluster setting kv.snapshot_rebalance.max_rate = '250 MiB';

-- Make sure we don't timeout on the systems.jobs table, something I learned from running very large imports in the past
ALTER TABLE system.jobs CONFIGURE ZONE USING range_max_bytes = 2<<30;
ALTER TABLE system.jobs CONFIGURE ZONE USING gc.ttlseconds = 300;

-- Make ranges 4x bigger than default in hopes it tries to create fewer up-front
alter range default configure zone using range_min_bytes = 536870912, range_max_bytes = 2147483648;

For now I'm trying to shepherd it through by pausing the job when I see a node almost out of mem, and then resume it after letting it recover or restarting it. I'm also seeing that ranges are gradually being merged and removed while the job is paused, so it gives me hope that as long as I keep the range count down somewhat, I'll have enough memory to complete the restore. But so far it hasn't succeeded yet.

Additional data / screenshots

Heap dumps from 2 nodes that were near max mem:

Environment:

CockroachDB version: v22.1.5

Jira issue: CRDB-18775

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-08-19T18:26:37Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/sql-schema (found keywords: ALTER TABLE)
@cockroachdb/kv (found keywords: kv)
@cockroachdb/bulk-io (found keywords: backup,restore)
@cockroachdb/sql-experience (found keywords: import)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2022-08-19T18:26:44Z

cc @cockroachdb/bulk-io

irfansharif · 2022-08-19T18:35:19Z

Could you attach the memory profiles you captured? Github should let you paste them in as comments (you might have to change the file extension perhaps).

dankinder · 2022-08-19T18:37:34Z

profile.pb.gz
profile (1).pb.gz

dankinder · 2022-08-19T18:38:49Z

Btw I should mention, the restore seemed to make it to about 5 or 10%, but no further before flat-out failing.

irfansharif · 2022-08-19T18:57:58Z

One other thing that would help is the replica count on these stores pre-oom. Our UI dashboards should have such data, but even logs from those stores pre-oom will be helpful.

dankinder · 2022-08-19T19:02:34Z

At this moment there are 3243458 replicas. It got up close to 6M.

I could give you a debug zip of the cluster right now, with the job paused, if you want it. Would want to use secure upload for that.

dankinder · 2022-08-19T19:13:34Z

One more note, even though those heap dumps only report ~6GB of heap, the processes being inspected there are using closer to 60GB, nearly the full memory of the machine.

dankinder · 2022-08-19T19:20:14Z

This one is from right when I resumed the job and seconds before the node OOMed:
profile (2).pb.gz

irfansharif · 2022-08-19T19:24:03Z

I’ll try taking a look on Monday or later tonight. The debug zip using the secure upload would be great (I’m actually not familiar with how it works from github support issues). I’m not surprised that things are OOM-ing with those replica counts (this is unfortunate) and will try hunting for quick wins, or in the interim, ways to not get it up that high for the restore you’re running.

dt · 2022-08-19T19:29:35Z

I see a couple somewhat known issues here:

a) the large manifest memory usage in the first screenshot.

The backup manifest is a file that contains a single protobuf message that has a list in it of all of the files, and their key ranges, in the backup; bigger backup with more files -> more metadata in that list -> bigger manifest. That said, we don't usually see issues this severe (6gb of usage) until there are many, many incremental layers, so our current advice while we transition the metadata representation to an iterable file format is to avoid larger numbers of incremental layers on bigger backups. Is this an incremental backup and if so, do you know how many layers it has? Of course, that advice is only useful when producing the backups; now we have no choice but to restore it.

b) restore splits at every file boundary in the base backup, which can often over-split.
We've seen this before but I don't think we've seen such an extreme example and we haven't seen it cause issues, just as a "huh, that's a lot of splits". If this is oom'ing in the innards of the kvserver, that's interesting.

Unfortunately I'm skeptical there is a "quick fix" for either of these like a setting or something we can just turn; the manifest size is being worked on but will probably not land until the next release; the over-split we could probably fix in a patch, but if you need something to get the RESTORE unblocked right now, your best bet might be to temporarily provision instances/pods/whatever with more RAM until the restore completes?

dankinder · 2022-08-19T19:43:11Z

It's a full backup, not incremental.

Thanks for the explanation, for the moment I'll see if it works out to just let the number of ranges come down while backup is paused and hope that frees enough memory. It's coming down about ~100k/hour. Unfortunately spinning up a big instance isn't so easy since these hosts are on-prem.

I just realized I can probably buy a bit of memory too by reducing cache/sql mem, so will try that too.

Even if I succeed I would like to prevent this for the future if possible. It it possible to make BACKUP store larger SSTs? Would it mean configuring pebble to use larger SSTs?

dt · 2022-08-19T19:49:00Z

It it possible to make BACKUP store larger SSTs? Would it mean configuring pebble to use larger SSTs?

The backup ssts are wholly separate from the underlying pebble SSTs. Unfortunately there isn't an easy setting to flip or anything here as the sizes are mostly determined by the order that current scans happen complete (since scanned keys need to go into a backup file in-order and we only have so much buffer to reorder scans that complete out of order before we flush that file and open a new one). Fortunately @stevendanna is actively (like, we were actually talking about this a couple hours ago) looking for ways to make this process smarter, with the explicit aim of making fewer, larger SSTs during backup (ideally in a way that is possible to backport in a patch) as a way to both a) reduce the metadata size and indirectly b) reduce that initial overspilt. But that unfortunately is going to take a code-change since it is an algorithmic fix, not something you can just tune in an existing cluster.

dt · 2022-08-19T22:44:19Z

I took a quick and dirty pass at a patch to see if I could mitigate the restore over-splitting: #86496

dankinder · 2022-08-22T17:33:10Z

Thanks @dt and @irfansharif , you guys are awesome.

FWIW I realized I could buy myself a bit more memory by reducing --cache, but even doing that, I'm still not getting it to succeed (yet). So as soon at #86496 gets approved I will try to get it deployed somehow.

dankinder · 2022-08-22T18:06:50Z

The latest restore, with reduced cache size, still did fail. But it was a different failure than before. Instead of the coordinator OOMing I got this:

importing 7148259 ranges: splitting key /Table/250/2/"028xjj.com"/"www"/"/"/"tag=400500.%E7%AE%A1%E5%AE%B6%E5%A9%86%E7%8E%84%E6%9C%BA"/"http"/2019-05-26T19:14:42.152Z: operation "pendingLeaseRequest: requesting lease" timed out after 22.349s (given timeout 6s): aborted during Replica.Send: context deadline exceeded

Not sure why it's importing 7148259 ranges, I did an LS on the backup bucket and there are 1780626 total objects, so shouldn't it be creating ~1780626 ranges?

I also saw instability in the cluster while it was trying to proceed, including liveness problems and errors like:

E220822 17:52:23.999864 118992429 kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:486 ⋮ [n3,s17,r7076589/3:‹/Table/250/2/"170{cpw.…-desi…}›] 3832966  failed to increment leaseholder's epoch: mismatch incrementing epoch for ‹liveness(nid:15 epo:8 exp:1661190743.146179391,0)›; actual is ‹liveness(nid:15 epo:8 exp:1661190752.145861678,0)›

I'm assuming that's just an overload symptom...

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as cockroachdb#88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: cockroachdb#86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

87449: workload,ttl: add TTL workload for benchmarking time to finish r=rafiss a=ecwall fixes #88172 Measures time row-level TTL job takes to run on a table: 1) Drop TTL table IF EXISTS. 2) Create a table without TTL. 3) Insert initialRowCount number of rows. 4) Gets number of rows that should expire. 5) Wait for table ranges to stabilize after scattering. 6) Enable TTL on table. 7) Poll table until TTL job is complete. Note: Ops is a no-op and no histograms are used. Benchmarking is done inside Hooks and details are logged. Adds useDistSQL field to TTL job progress protobuf for visibility into which version was run during cluster upgrades. Release justification: Added TTL workload. Release note: None 89317: sql,tree: improve function resolution efficiency r=ajwerner a=ajwerner #### sql: prevent allocations by avoiding some name pointers We don't need pointers for these names. They generally won't escape. #### sql,tree: change SearchPath to avoid allocations The closure-oriented interface was forcing the closures and the variables they referenced to escape to the heap. This change, while not beautiful, ends up being much more efficient. ``` name old time/op new time/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 20.4ms ±11% 18.9ms ± 8% -7.47% (p=0.000 n=20+19) name old alloc/op new alloc/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 10.1MB ±29% 9.8MB ±29% ~ (p=0.231 n=20+20) name old allocs/op new allocs/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 56.3k ± 7% 50.2k ±10% -10.81% (p=0.000 n=19+19) ``` Release note: None 89333: backupccl: enable `restore_span.target_size` r=dt,stevendanna a=adityamaru This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore. Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]>

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

dankinder added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 19, 2022

blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Aug 19, 2022

blathers-crl bot added the T-kv KV Team label Aug 19, 2022

irfansharif changed the title ~~Restore on large database fails with OOM~~ kvserver: oom due to high replica count post-restore Aug 19, 2022

irfansharif self-assigned this Aug 19, 2022

dankinder mentioned this issue Sep 2, 2022

backupccl: reduce over-splitting during RESTORE #86496

Merged

adityamaru mentioned this issue Oct 4, 2022

backupccl: enable restore_span.target_size #89333

Merged

blathers-crl bot mentioned this issue Oct 4, 2022

release-22.2: backupccl: enable restore_span.target_size #89351

Merged

exalate-issue-sync bot removed the T-disaster-recovery label Oct 10, 2022

blathers-crl bot mentioned this issue Oct 24, 2022

release-22.2.0: release-22.2: backupccl: enable restore_span.target_size #90573

Closed

blathers-crl bot mentioned this issue Oct 25, 2022

release-22.2.0: backupccl: enable restore_span.target_size #90630

Closed

msbutler mentioned this issue Nov 10, 2022

release-22.1: backupccl: add 8TB TPCE restore nightly roachtest #91511

Merged

exalate-issue-sync bot unassigned irfansharif Dec 7, 2022

nvanbenschoten closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: oom due to high replica count post-restore #86470

kvserver: oom due to high replica count post-restore #86470

dankinder commented Aug 19, 2022 •

edited

Loading

blathers-crl bot commented Aug 19, 2022

blathers-crl bot commented Aug 19, 2022

irfansharif commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

irfansharif commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

irfansharif commented Aug 19, 2022 •

edited

Loading

dt commented Aug 19, 2022

dankinder commented Aug 19, 2022

dt commented Aug 19, 2022

dt commented Aug 19, 2022

dankinder commented Aug 22, 2022

dankinder commented Aug 22, 2022 •

edited

Loading

kvserver: oom due to high replica count post-restore #86470

kvserver: oom due to high replica count post-restore #86470

Comments

dankinder commented Aug 19, 2022 • edited Loading

blathers-crl bot commented Aug 19, 2022

blathers-crl bot commented Aug 19, 2022

irfansharif commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

irfansharif commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

dankinder commented Aug 19, 2022

irfansharif commented Aug 19, 2022 • edited Loading

dt commented Aug 19, 2022

dankinder commented Aug 19, 2022

dt commented Aug 19, 2022

dt commented Aug 19, 2022

dankinder commented Aug 22, 2022

dankinder commented Aug 22, 2022 • edited Loading

dankinder commented Aug 19, 2022 •

edited

Loading

irfansharif commented Aug 19, 2022 •

edited

Loading

dankinder commented Aug 22, 2022 •

edited

Loading