-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: oom due to high replica count post-restore #86470
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @cockroachdb/bulk-io |
Could you attach the memory profiles you captured? Github should let you paste them in as comments (you might have to change the file extension perhaps). |
Btw I should mention, the restore seemed to make it to about 5 or 10%, but no further before flat-out failing. |
One other thing that would help is the replica count on these stores pre-oom. Our UI dashboards should have such data, but even logs from those stores pre-oom will be helpful. |
At this moment there are 3243458 replicas. It got up close to 6M. I could give you a debug zip of the cluster right now, with the job paused, if you want it. Would want to use secure upload for that. |
One more note, even though those heap dumps only report ~6GB of heap, the processes being inspected there are using closer to 60GB, nearly the full memory of the machine. |
This one is from right when I resumed the job and seconds before the node OOMed: |
I’ll try taking a look on Monday or later tonight. The debug zip using the secure upload would be great (I’m actually not familiar with how it works from github support issues). I’m not surprised that things are OOM-ing with those replica counts (this is unfortunate) and will try hunting for quick wins, or in the interim, ways to not get it up that high for the restore you’re running. |
I see a couple somewhat known issues here: a) the large manifest memory usage in the first screenshot. The backup manifest is a file that contains a single protobuf message that has a list in it of all of the files, and their key ranges, in the backup; bigger backup with more files -> more metadata in that list -> bigger manifest. That said, we don't usually see issues this severe (6gb of usage) until there are many, many incremental layers, so our current advice while we transition the metadata representation to an iterable file format is to avoid larger numbers of incremental layers on bigger backups. Is this an incremental backup and if so, do you know how many layers it has? Of course, that advice is only useful when producing the backups; now we have no choice but to restore it. b) restore splits at every file boundary in the base backup, which can often over-split. Unfortunately I'm skeptical there is a "quick fix" for either of these like a setting or something we can just turn; the manifest size is being worked on but will probably not land until the next release; the over-split we could probably fix in a patch, but if you need something to get the RESTORE unblocked right now, your best bet might be to temporarily provision instances/pods/whatever with more RAM until the restore completes? |
It's a full backup, not incremental. Thanks for the explanation, for the moment I'll see if it works out to just let the number of ranges come down while backup is paused and hope that frees enough memory. It's coming down about ~100k/hour. Unfortunately spinning up a big instance isn't so easy since these hosts are on-prem. I just realized I can probably buy a bit of memory too by reducing cache/sql mem, so will try that too. Even if I succeed I would like to prevent this for the future if possible. It it possible to make BACKUP store larger SSTs? Would it mean configuring pebble to use larger SSTs? |
The backup ssts are wholly separate from the underlying pebble SSTs. Unfortunately there isn't an easy setting to flip or anything here as the sizes are mostly determined by the order that current scans happen complete (since scanned keys need to go into a backup file in-order and we only have so much buffer to reorder scans that complete out of order before we flush that file and open a new one). Fortunately @stevendanna is actively (like, we were actually talking about this a couple hours ago) looking for ways to make this process smarter, with the explicit aim of making fewer, larger SSTs during backup (ideally in a way that is possible to backport in a patch) as a way to both a) reduce the metadata size and indirectly b) reduce that initial overspilt. But that unfortunately is going to take a code-change since it is an algorithmic fix, not something you can just tune in an existing cluster. |
I took a quick and dirty pass at a patch to see if I could mitigate the restore over-splitting: #86496 |
Thanks @dt and @irfansharif , you guys are awesome. FWIW I realized I could buy myself a bit more memory by reducing |
The latest restore, with reduced cache size, still did fail. But it was a different failure than before. Instead of the coordinator OOMing I got this:
Not sure why it's importing 7148259 ranges, I did an LS on the backup bucket and there are 1780626 total objects, so shouldn't it be creating ~1780626 ranges? I also saw instability in the cluster while it was trying to proceed, including liveness problems and errors like:
I'm assuming that's just an overload symptom... |
This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as cockroachdb#88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: cockroachdb#86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.
87449: workload,ttl: add TTL workload for benchmarking time to finish r=rafiss a=ecwall fixes #88172 Measures time row-level TTL job takes to run on a table: 1) Drop TTL table IF EXISTS. 2) Create a table without TTL. 3) Insert initialRowCount number of rows. 4) Gets number of rows that should expire. 5) Wait for table ranges to stabilize after scattering. 6) Enable TTL on table. 7) Poll table until TTL job is complete. Note: Ops is a no-op and no histograms are used. Benchmarking is done inside Hooks and details are logged. Adds useDistSQL field to TTL job progress protobuf for visibility into which version was run during cluster upgrades. Release justification: Added TTL workload. Release note: None 89317: sql,tree: improve function resolution efficiency r=ajwerner a=ajwerner #### sql: prevent allocations by avoiding some name pointers We don't need pointers for these names. They generally won't escape. #### sql,tree: change SearchPath to avoid allocations The closure-oriented interface was forcing the closures and the variables they referenced to escape to the heap. This change, while not beautiful, ends up being much more efficient. ``` name old time/op new time/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 20.4ms ±11% 18.9ms ± 8% -7.47% (p=0.000 n=20+19) name old alloc/op new alloc/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 10.1MB ±29% 9.8MB ±29% ~ (p=0.231 n=20+20) name old allocs/op new allocs/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 56.3k ± 7% 50.2k ±10% -10.81% (p=0.000 n=19+19) ``` Release note: None 89333: backupccl: enable `restore_span.target_size` r=dt,stevendanna a=adityamaru This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore. Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]>
This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.
This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.
This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.
Describe the problem
I'm attempting to restore a large database from S3 backup, and if naively let it run, nodes repeatedly run out of memory and crash. It's a 40-node cluster (each 64 vcpu, 64GB ram, 6 1TB drives). The backup is 29.5 TB (original DB is ~120TB), BACKUP_MANIFEST is 518.9MB.
When I try to run it, it starts creating millions of ranges, even though the pre-backup cluster only had a few hundred thousand. A lot of CRDB memory seems to get eaten up by having this many ranges, so when one node adopts the job to run the restore, it runs out of memory pretty fast.
The situation then gets worse: the job gets automatically retried a bunch of times, and every time it creates more ranges. When it does finally fail, the GC seems to take forever, so I didn't even wait for that, I had to blow away the cluster to really try again.
Here is some of the configuration I've tried so far:
For now I'm trying to shepherd it through by pausing the job when I see a node almost out of mem, and then resume it after letting it recover or restarting it. I'm also seeing that ranges are gradually being merged and removed while the job is paused, so it gives me hope that as long as I keep the range count down somewhat, I'll have enough memory to complete the restore. But so far it hasn't succeeded yet.
Additional data / screenshots
Heap dumps from 2 nodes that were near max mem:
![Screen Shot 2022-08-19 at 2 24 13 PM](https://user-images.githubusercontent.com/5198575/185683328-b0f9be11-cf4f-4be9-89d7-7cb663900547.png)
![Screen Shot 2022-08-19 at 2 24 23 PM](https://user-images.githubusercontent.com/5198575/185683356-2d71754b-0b4a-4a65-8854-b352f07ea308.png)
Environment:
Jira issue: CRDB-18775
The text was updated successfully, but these errors were encountered: