-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve registry file migration performance #20717
Conversation
The registry uses a checkpoint-check that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disable when migration the old registry file. During migration the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly.
Pinging @elastic/integrations-services (Team:Services) |
💔 Tests FailedExpand to view the summary
Build stats
Test stats 🧪
Test errorsExpand to view the tests failures
Steps errorsExpand to view the steps failures
Log outputExpand to view the last 100 lines of log output
|
CI seams quite unstable the last few days. All registry related and registry migration tests did pass. |
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit 03748b3)
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit 03748b3)
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes #20705 (cherry picked from commit 03748b3) Co-authored-by: Steffen Siering <[email protected]>
#20769) * Improve registry file migration performance (#20717) ## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes #20705 (cherry picked from commit 03748b3) * fix changelog Co-authored-by: Steffen Siering <[email protected]>
## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705
…formance (elastic#20769) * Improve registry file migration performance (elastic#20717) ## What does this PR do? Ensure that fsync is called only once after the migration of old state entries is complete. ## Why is it important? The registry uses a checkpoint-predicate that when true writes all state to disk and calls fsync. The checkpoint operation is supposed to be disabled when migration the old registry file. During migration, the old state will be directly copied (after cleanup and schema changes are applied). The old state will only be deleted after the migration of all states is complete. Unfortunately, the checkpoint predicate did return true, instead of false, which did trigger a checkpoint operation per state to be migrated. The fix disables fsync, and now finalizes the migration by calling Checkpoint directly. The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries): ``` BenchmarkMigration0To1/1-32 286 4203353 ns/op BenchmarkMigration0To1/10-32 34 35730680 ns/op BenchmarkMigration0To1/100-32 2 720890839 ns/op BenchmarkMigration0To1/1000-32 1 31633569085 ns/op ... test timed out after 10min ``` Benchmark results with the fix (migration 100k entries took ~7.6s): ``` BenchmarkMigration0To1/1-32 274 4371400 ns/op BenchmarkMigration0To1/10-32 259 4639209 ns/op BenchmarkMigration0To1/100-32 100 13374147 ns/op BenchmarkMigration0To1/1000-32 13 104220944 ns/op BenchmarkMigration0To1/10000-32 2 916656798 ns/op BenchmarkMigration0To1/100000-32 1 7616648790 ns/op PASS ``` Closes elastic#20705 (cherry picked from commit c2be4a7) * fix changelog Co-authored-by: Steffen Siering <[email protected]>
What does this PR do?
Ensure that fsync is called only once after the migration of old state entries is complete.
Why is it important?
The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.
Unfortunately the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.
The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):
Benchmark results with fix (migration 100k entries took ~7.6s):
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues