Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: oom due to high replica count post-restore #86470

Closed
dankinder opened this issue Aug 19, 2022 · 16 comments
Closed

kvserver: oom due to high replica count post-restore #86470

dankinder opened this issue Aug 19, 2022 · 16 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team X-blathers-triaged blathers was able to find an owner

Comments

@dankinder
Copy link

dankinder commented Aug 19, 2022

Describe the problem

I'm attempting to restore a large database from S3 backup, and if naively let it run, nodes repeatedly run out of memory and crash. It's a 40-node cluster (each 64 vcpu, 64GB ram, 6 1TB drives). The backup is 29.5 TB (original DB is ~120TB), BACKUP_MANIFEST is 518.9MB.

When I try to run it, it starts creating millions of ranges, even though the pre-backup cluster only had a few hundred thousand. A lot of CRDB memory seems to get eaten up by having this many ranges, so when one node adopts the job to run the restore, it runs out of memory pretty fast.

The situation then gets worse: the job gets automatically retried a bunch of times, and every time it creates more ranges. When it does finally fail, the GC seems to take forever, so I didn't even wait for that, I had to blow away the cluster to really try again.

Here is some of the configuration I've tried so far:

-- Settings to slow it down as much as possible
set cluster setting bulkio.ingest.sender_concurrency_limit = 1;
-- Based on noticing https://github.com/cockroachdb/cockroach/pull/76907
set cluster setting kv.bulk_io_write.restore_node_concurrency = 1;
set cluster setting cloudstorage.s3.read.node_rate_limit = '10 MiB'; -- extremely low to see if it helps

-- And ensure ranges get rebalanced
set cluster setting kv.snapshot_recovery.max_rate = '250 MiB';
set cluster setting kv.snapshot_rebalance.max_rate = '250 MiB';

-- Make sure we don't timeout on the systems.jobs table, something I learned from running very large imports in the past
ALTER TABLE system.jobs CONFIGURE ZONE USING range_max_bytes = 2<<30;
ALTER TABLE system.jobs CONFIGURE ZONE USING gc.ttlseconds = 300;

-- Make ranges 4x bigger than default in hopes it tries to create fewer up-front
alter range default configure zone using range_min_bytes = 536870912, range_max_bytes = 2147483648;

For now I'm trying to shepherd it through by pausing the job when I see a node almost out of mem, and then resume it after letting it recover or restarting it. I'm also seeing that ranges are gradually being merged and removed while the job is paused, so it gives me hope that as long as I keep the range count down somewhat, I'll have enough memory to complete the restore. But so far it hasn't succeeded yet.

Additional data / screenshots

Heap dumps from 2 nodes that were near max mem:
Screen Shot 2022-08-19 at 2 24 13 PM
Screen Shot 2022-08-19 at 2 24 23 PM

Environment:

  • CockroachDB version: v22.1.5

Jira issue: CRDB-18775

@dankinder dankinder added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 19, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-schema (found keywords: ALTER TABLE)
  • @cockroachdb/kv (found keywords: kv)
  • @cockroachdb/bulk-io (found keywords: backup,restore)
  • @cockroachdb/sql-experience (found keywords: import)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Aug 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 19, 2022

cc @cockroachdb/bulk-io

@irfansharif
Copy link
Contributor

Could you attach the memory profiles you captured? Github should let you paste them in as comments (you might have to change the file extension perhaps).

@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 19, 2022
@dankinder
Copy link
Author

@irfansharif irfansharif changed the title Restore on large database fails with OOM kvserver: oom due to high replica count post-restore Aug 19, 2022
@dankinder
Copy link
Author

Btw I should mention, the restore seemed to make it to about 5 or 10%, but no further before flat-out failing.

@irfansharif
Copy link
Contributor

One other thing that would help is the replica count on these stores pre-oom. Our UI dashboards should have such data, but even logs from those stores pre-oom will be helpful.

@dankinder
Copy link
Author

At this moment there are 3243458 replicas. It got up close to 6M.

I could give you a debug zip of the cluster right now, with the job paused, if you want it. Would want to use secure upload for that.

@dankinder
Copy link
Author

One more note, even though those heap dumps only report ~6GB of heap, the processes being inspected there are using closer to 60GB, nearly the full memory of the machine.

@irfansharif irfansharif self-assigned this Aug 19, 2022
@dankinder
Copy link
Author

This one is from right when I resumed the job and seconds before the node OOMed:
profile (2).pb.gz

@irfansharif
Copy link
Contributor

irfansharif commented Aug 19, 2022

I’ll try taking a look on Monday or later tonight. The debug zip using the secure upload would be great (I’m actually not familiar with how it works from github support issues). I’m not surprised that things are OOM-ing with those replica counts (this is unfortunate) and will try hunting for quick wins, or in the interim, ways to not get it up that high for the restore you’re running.

@dt
Copy link
Member

dt commented Aug 19, 2022

I see a couple somewhat known issues here:

a) the large manifest memory usage in the first screenshot.

The backup manifest is a file that contains a single protobuf message that has a list in it of all of the files, and their key ranges, in the backup; bigger backup with more files -> more metadata in that list -> bigger manifest. That said, we don't usually see issues this severe (6gb of usage) until there are many, many incremental layers, so our current advice while we transition the metadata representation to an iterable file format is to avoid larger numbers of incremental layers on bigger backups. Is this an incremental backup and if so, do you know how many layers it has? Of course, that advice is only useful when producing the backups; now we have no choice but to restore it.

b) restore splits at every file boundary in the base backup, which can often over-split.
We've seen this before but I don't think we've seen such an extreme example and we haven't seen it cause issues, just as a "huh, that's a lot of splits". If this is oom'ing in the innards of the kvserver, that's interesting.

Unfortunately I'm skeptical there is a "quick fix" for either of these like a setting or something we can just turn; the manifest size is being worked on but will probably not land until the next release; the over-split we could probably fix in a patch, but if you need something to get the RESTORE unblocked right now, your best bet might be to temporarily provision instances/pods/whatever with more RAM until the restore completes?

@dankinder
Copy link
Author

It's a full backup, not incremental.

Thanks for the explanation, for the moment I'll see if it works out to just let the number of ranges come down while backup is paused and hope that frees enough memory. It's coming down about ~100k/hour. Unfortunately spinning up a big instance isn't so easy since these hosts are on-prem.

I just realized I can probably buy a bit of memory too by reducing cache/sql mem, so will try that too.

Even if I succeed I would like to prevent this for the future if possible. It it possible to make BACKUP store larger SSTs? Would it mean configuring pebble to use larger SSTs?

@dt
Copy link
Member

dt commented Aug 19, 2022

It it possible to make BACKUP store larger SSTs? Would it mean configuring pebble to use larger SSTs?

The backup ssts are wholly separate from the underlying pebble SSTs. Unfortunately there isn't an easy setting to flip or anything here as the sizes are mostly determined by the order that current scans happen complete (since scanned keys need to go into a backup file in-order and we only have so much buffer to reorder scans that complete out of order before we flush that file and open a new one). Fortunately @stevendanna is actively (like, we were actually talking about this a couple hours ago) looking for ways to make this process smarter, with the explicit aim of making fewer, larger SSTs during backup (ideally in a way that is possible to backport in a patch) as a way to both a) reduce the metadata size and indirectly b) reduce that initial overspilt. But that unfortunately is going to take a code-change since it is an algorithmic fix, not something you can just tune in an existing cluster.

@dt
Copy link
Member

dt commented Aug 19, 2022

I took a quick and dirty pass at a patch to see if I could mitigate the restore over-splitting: #86496

@dankinder
Copy link
Author

Thanks @dt and @irfansharif , you guys are awesome.

FWIW I realized I could buy myself a bit more memory by reducing --cache, but even doing that, I'm still not getting it to succeed (yet). So as soon at #86496 gets approved I will try to get it deployed somehow.

@dankinder
Copy link
Author

dankinder commented Aug 22, 2022

The latest restore, with reduced cache size, still did fail. But it was a different failure than before. Instead of the coordinator OOMing I got this:

importing 7148259 ranges: splitting key /Table/250/2/"028xjj.com"/"www"/"/"/"tag=400500.%E7%AE%A1%E5%AE%B6%E5%A9%86%E7%8E%84%E6%9C%BA"/"http"/2019-05-26T19:14:42.152Z: operation "pendingLeaseRequest: requesting lease" timed out after 22.349s (given timeout 6s): aborted during Replica.Send: context deadline exceeded

Not sure why it's importing 7148259 ranges, I did an LS on the backup bucket and there are 1780626 total objects, so shouldn't it be creating ~1780626 ranges?

I also saw instability in the cluster while it was trying to proceed, including liveness problems and errors like:

E220822 17:52:23.999864 118992429 kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:486 ⋮ [n3,s17,r7076589/3:‹/Table/250/2/"170{cpw.…-desi…}›] 3832966  failed to increment leaseholder's epoch: mismatch incrementing epoch for ‹liveness(nid:15 epo:8 exp:1661190743.146179391,0)›; actual is ‹liveness(nid:15 epo:8 exp:1661190752.145861678,0)›

I'm assuming that's just an overload symptom...

adityamaru added a commit to adityamaru/cockroach that referenced this issue Oct 4, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
cockroachdb#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: cockroachdb#86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
craig bot pushed a commit that referenced this issue Oct 4, 2022
87449: workload,ttl: add TTL workload for benchmarking time to finish r=rafiss a=ecwall

fixes #88172

Measures time row-level TTL job takes to run on a table:
1) Drop TTL table IF EXISTS.
2) Create a table without TTL.
3) Insert initialRowCount number of rows.
4) Gets number of rows that should expire.
5) Wait for table ranges to stabilize after scattering.
6) Enable TTL on table.
7) Poll table until TTL job is complete.
Note: Ops is a no-op and no histograms are used.
Benchmarking is done inside Hooks and details are logged.

Adds useDistSQL field to TTL job progress protobuf for
visibility into which version was run during cluster
upgrades.

Release justification: Added TTL workload.

Release note: None

89317: sql,tree: improve function resolution efficiency r=ajwerner a=ajwerner

#### sql: prevent allocations by avoiding some name pointers

We don't need pointers for these names. They generally won't escape.

#### sql,tree: change SearchPath to avoid allocations
The closure-oriented interface was forcing the closures and the variables they
referenced to escape to the heap. This change, while not beautiful, ends up being
much more efficient.

```
name                                         old time/op    new time/op    delta
SQL/MultinodeCockroach/Upsert/count=1000-16    20.4ms ±11%    18.9ms ± 8%   -7.47%  (p=0.000 n=20+19)

name                                         old alloc/op   new alloc/op   delta
SQL/MultinodeCockroach/Upsert/count=1000-16    10.1MB ±29%     9.8MB ±29%     ~     (p=0.231 n=20+20)

name                                         old allocs/op  new allocs/op  delta
SQL/MultinodeCockroach/Upsert/count=1000-16     56.3k ± 7%     50.2k ±10%  -10.81%  (p=0.000 n=19+19)
```

Release note: None

89333: backupccl: enable `restore_span.target_size` r=dt,stevendanna a=adityamaru

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as
#88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: adityamaru <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 4, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
blathers-crl bot pushed a commit that referenced this issue Oct 24, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
blathers-crl bot pushed a commit that referenced this issue Oct 25, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

4 participants