-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort L0 files by newly introduced epoch_num #10922
Conversation
2107596
to
a616822
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally finished thinking about all the parts I hoped to think about. Great work on this very difficult problem
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Update: Addressed the comments and going thru compatibility + stress test again. Will notify here when it's ready for review again. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments. LGTM!!
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks for the review!!! Compatibility test and stress tests finish good so I will land this once signals are clear. |
…rash test (#11063) Summary: **Context/Summary:** #10777 was reverted (#10999) due to internal blocker and replaced with a better fix #10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like #10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: #11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
Context:
Sorting L0 files by
largest_seqno
has at least two inconvenience:force_consistency_check=true
will catch such overlap seqno range even those harmless overlap.force_consistency_check=true
will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1largest_seqno
can introduce a wrong ordering hence file reordering corruptionforce_consistency_check=true
, there isn't a good way to prevent this from happening if ordering bylargest_seqno
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to @ajkr , we now introduce
epoch_num
which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimumepoch_num
among input files'). This will avoid the above inconvenience in the following ways:force_consistency_check=true
butepoch_number
ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test classDBCompactionTestL0FilesMisorderCorruption*
for more.Summary:
epoch_number
stored perColumnFamilyData
and sort CF's L0 files by their assignedepoch_number
instead oflargest_seqno
.epoch_number
is increased and assigned uponVersionEdit::AddFile()
for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as thekReservedEpochNumberForFileIngestedBehind
)epoch_number
among input files'epoch_number
treatment:NewestFirstBySeqNo
NewestFirstBySeqNo
.VersionEdit::AddFile()
where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).Refit(target_level=0)
). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.epoch_number
of each file in manifest and recoverepoch_number
on db recoveryepoch_number
support is guaranteed by assigningepoch_number
to recovered files byNewestFirstBySeqno
order. SeeVersionStorageInfo::RecoverEpochNumbers()
for moreNewFileCustomTag
force_consistent_check
on L0 withepoch_number
and remove false positive check like case 1 withlargest_seqno
aboveNewestFirstBySeqno
) to check/sort them till we infer their epoch number. See usages ofEpochNumberRequirement
.epoch_number
so make check will passepoch_number
and cover universal/fifo compaction/CompactRange/CompactFile casesTest:
make check
db/db_compaction_test.cc
,db/db_test2.cc
,db/version_builder_test.cc
,db/repair_test.cc
DBCompactionTestL0FilesMisorderCorruption*
) under Fix corruption with intra-L0 on ingested files #5958 (comment).orig
binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour onsimple black/white box
,cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox