-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup: support split big region into small backup files (#9283) #9448
backup: support split big region into small backup files (#9283) #9448
Conversation
Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Chunzhu Li <[email protected]>
Signed-off-by: Chunzhu Li <[email protected]>
Signed-off-by: Chunzhu Li <[email protected]>
/lgtm |
@kennytm: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/ti-community-prow repository. |
/lgtm |
/label do-not-merge/cherry-pick-not-approved |
This cherry pick PR is for a release branch and has not yet been approved by release team. To merge this cherry pick, it must first be approved (/lgtm + /merge) by the collaborators. AFTER it has been approved by collaborators, please psend an email to the QA team requesting approval and the QA team will help you merge the PR. |
@overvenus PTAL |
/label cherry-pick-approved |
/lgtm |
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by writing |
/merge |
@NingLin-P: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
@NingLin-P: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/lgtm |
/run-all-tests |
/merge |
@overvenus: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: 828c71b
|
@ti-srebot: Your PR has out-of-dated, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
…ikv#9448) cherry-pick tikv#9283 to release-4.0 --- <!-- Thank you for contributing to TiKV! If you haven't already, please read TiKV's [CONTRIBUTING](https://github.com/tikv/tikv/blob/master/CONTRIBUTING.md) document. If you're unsure about anything, just ask; somebody should be along to answer within a day or two. PR Title Format: 1. module [, module2, module3]: what's changed 2. *: what's changed If you want to open the **Challenge Program** pull request, please use the following template: https://raw.githubusercontent.com/tikv/.github/master/.github/PULL_REQUEST_TEMPLATE/challenge-program.md You can use it with query parameters: https://github.com/tikv/tikv/compare/master...${you branch}?template=challenge-program.md --> ### What problem does this PR solve? Issue Number: close tikv#9144 <!-- REMOVE this line if no issue to close --> Problem Summary: BR will read all data of a region and fill it in a SST writer. But it is in-memory. If there is a huge region, TiKV may crash for OOM because of keeping all data of this region in memory. ### What is changed and how it works? What's Changed: Record the written txn entries' size. When it reaches `region_max_size`, we will save the data cached in RocksDB to a SST file and then switch to the next file. ### Related changes - Need to cherry-pick to the release branch ### Check List <!--REMOVE the items that are not applicable--> Tests <!-- At least one of them must be included. --> - Unit test - Integration test - Manual test (add detailed scripts or steps below) 1. Set `sst-max-size` to 15MiB. ``` mysql> select * from CLUSTER_CONFIG where `TYPE`="tikv"; +------+-----------------+---------------------------------------------------------------+------------------------------------------------------+ | TYPE | INSTANCE | KEY | VALUE | +------+-----------------+---------------------------------------------------------------+------------------------------------------------------+ | tikv | 127.0.0.1:20160 | backup.batch-size | 8 | | tikv | 127.0.0.1:20160 | backup.num-threads | 9 | | tikv | 127.0.0.1:20160 | backup.sst-max-size | 15MiB | ... ``` 2. Backup around 100MB data(without compaction) successfully. ``` $ ./br backup full -s ./backup --pd http://127.0.0.1:2379 Full backup <--------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00% Checksum <-----------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00% [2020/12/31 14:39:12.534 +08:00] [INFO] [collector.go:60] ["Full backup Success summary: total backup ranges: 2, total success: 2, total failed: 0, total take(Full backup time): 4.273097395s, total take(real time): 8.133315406s, total kv: 8000000, total size(MB): 361.27, avg speed(MB/s): 84.55"] ["backup checksum"=901.754111ms] ["backup fast checksum"=6.09384ms] ["backup total regions"=10] [BackupTS=421893700168974340] [Size=48023090] ``` 3. The big region can be split into several files: ``` -rw-r--r-- 1 * * 1.5M Dec 31 14:39 1_60_28_74219326eeb0a4ae3a0f5190f7784132bb0e44791391547ef66862aaeb668579_1609396745730_write.sst -rw-r--r-- 1 * * 1.2M Dec 31 14:39 1_60_28_b7a5509d9912c66a21589d614cfc8828acd4051a7eeea3f24f5a7b337b5a389e_1609396746062_write.sst -rw-r--r-- 1 * * 1.5M Dec 31 14:39 1_60_28_cdcc2ce1c18a30a2b779b574f64de9f0e3be81c2d8720d5af0a9ef9633f8fbb7_1609396745429_write.sst -rw-r--r-- 1 * * 2.4M Dec 31 14:39 1_62_28_4259e616a6e7b70c33ee64af60230f3e4160af9ac7aac723f033cddf6681826a_1609396747038_write.sst -rw-r--r-- 1 * * 2.4M Dec 31 14:39 1_62_28_5d0de44b65fb805e45c93278661edd39792308c8ce90855b54118c4959ec9f16_1609396746731_write.sst -rw-r--r-- 1 * * 2.4M Dec 31 14:39 1_62_28_ef7ab4b5471b088ee909870e316d926f31f4f6ec771754690eac61af76e8782c_1609396747374_write.sst -rw-r--r-- 1 * * 1.5M Dec 31 14:39 1_64_29_74211aae8215fe9cde8bd7ceb8494afdcc18e5c6a8c5830292a577a9859d38e1_1609396746671_write.sst -rw-r--r-- 1 * * 1.2M Dec 31 14:39 1_64_29_81e152c98742938c1662241fac1c841319029e800da6881d799a16723cb42888_1609396747010_write.sst -rw-r--r-- 1 * * 1.5M Dec 31 14:39 1_64_29_ce0dde9826aee9e5ccac0a516f18b9871d3897effd559ff7450b8e56ac449bbd_1609396746349_write.sst -rw-r--r-- 1 * * 78 Dec 31 14:39 backup.lock -rw-r--r-- 1 * * 229K Dec 31 14:39 backupmeta ``` 4. Restore backuped data. It works successfully and passes the manual check. ``` ./br restore full -s ./backup --pd http://127.0.0.1:2379 Full restore <-------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00% [2020/12/31 14:42:49.983 +08:00] [INFO] [collector.go:60] ["Full restore Success summary: total restore files: 27, total success: 27, total failed: 0, total take(Full restore time): 5.063048828s, total take(real time): 7.84620924s, total kv: 8000000, total size(MB): 361.27, avg speed(MB/s): 71.36"] ["split region"=26.217737ms] ["restore checksum"=4.10792638s] ["restore ranges"=26] [Size=48023090] ``` ### Release note <!-- bugfixes or new feature need a release note --> - Fix the problem that TiKV OOM when we backup a huge region.
cherry-pick #9283 to release-4.0
What problem does this PR solve?
Issue Number: close #9144
Problem Summary: BR will read all data of a region and fill it in a SST writer. But it is in-memory. If there is a huge region, TiKV may crash for OOM because of keeping all data of this region in memory.
What is changed and how it works?
What's Changed: Record the written txn entries' size. When it reaches
region_max_size
, we will save the data cached in RocksDB to a SST file and then switch to the next file.Related changes
Check List
Tests
sst-max-size
to 15MiB.Release note