forked from apache/kudu
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[catalog_manager] KUDU-3344 clean up deleted tables and tablets #3
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zhangyifan27
commented
Dec 30, 2021
1a45cd8
to
ef3171c
Compare
zhangyifan27
commented
Jan 7, 2022
Change-Id: Idefa2ee2f5108ba913fe0057a4061c3c28351547
zhangyifan27
pushed a commit
that referenced
this pull request
May 4, 2023
Since the original implementation stored the random choice for replica selection integer in a variable that was initialized statically, the corresponding calls to libstdc++/libc++ runtime had been issued before the process called the main() function. That means some SSE4.2-specific instructions might be called since libkudu_client is unconditionally compiled with -msse4.2 flag, and there'd been no chance to call KuduClientBuilder::Build() that would verify the required features are present by calling CheckCPUFlags(). As a result, an attempt to run an application linked with kudu_client library at a machine lacking SSE4.2 support would result in a crash with SIGILL signal and a stack trace like below: #0 0x00007fc4b1b58162 in std::mersenne_twister_engine<...>::_M_gen_rand at include/c++/7.5.0/bits/random.tcc:408 #1 std::mersenne_twister_engine<...>::operator() at include/c++/7.5.0/bits/random.tcc:459 #2 0x00007fc4b1b1d65d in kudu::client::(anonymous namespace)::InitRandomSelectionInt at ../../../../../src/kudu/client/client-internal.cc:196 #3 0x00007fc4b1b1d6ef in __static_initialization_and_destruction_0 at ../../../../../src/kudu/client/client-internal.cc:198 #4 _GLOBAL__sub_I_client_internal.cc(void) at ../../../../../src/kudu/client/client-internal.cc:871 This patch addresses that deficiency, so now instead of unexpectedly crashing, the application would return an error upon at attempt to create an instance of KuduClient object. This is a follow-up to ccbbfb3. Change-Id: I11c2a29ef69a8c97c68330d261fdff64accebb0b Reviewed-on: http://gerrit.cloudera.org:8080/19828 Reviewed-by: Abhishek Chennaka <[email protected]> Reviewed-by: Wenzhe Zhou <[email protected]> Tested-by: Alexey Serbin <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Oct 25, 2023
This update helps to prevent SIGSEGV in libunwind when running Kudu on aarch64 (in particular, Graviton3 instances in EC2). An example of stack trace looked like below, and it's similar to the stack mentioned in [1]: #0 access_mem (as=0x3304418 <local_addr_space>, addr=7745970402396146688, val=0xfffff325ca18, write=0, arg=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Ginit.c:337 #1 0x0000000000a97ac0 in is_plt_entry (c=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:43 #2 0x0000000000a97fdc in _ULaarch64_step (cursor=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:171 #3 0x00000000025050c8 in kudu::StackTrace::Collect ( this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0) at src/kudu/util/debug-util.cc:612 #4 0x0000000002507f64 in kudu::StackTrace::Collect ( this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0) at src/kudu/util/debug-util.cc:579 [1] libunwind/libunwind#260 Change-Id: Ie34dc56f78abba537aa15dd3d9c0540157d9afa3 Reviewed-on: http://gerrit.cloudera.org:8080/20540 Tested-by: Kudu Jenkins Reviewed-by: Michael Smith <[email protected]> Reviewed-by: Mahesh Reddy <[email protected]> Reviewed-by: Abhishek Chennaka <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Jun 12, 2024
It turned out that auto leader rebalancing task wasn't explicitly shutdown upon shutting down catalog manager. That lead to race conditions as reported by TSAN, at least in test scenarios (see below). This patch addresses the issue. WARNING: ThreadSanitizer: data race (pid=23827) Write of size 1 at 0x7b4000008208 by main thread: #0 AnnotateRWLockDestroy thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp:264 (auto_rebalancer-test+0x33575e) #1 kudu::rw_spinlock::~rw_spinlock() src/kudu/util/locks.h:89:5 (libmaster.so+0x359376) #2 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:108:1 (libmaster.so+0x4ad201) #3 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:107:25 (libmaster.so+0x4ad229) #4 std::__1::default_delete<kudu::master::TSManager>::operator()(kudu::master::TSManager*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x407ce7) #5 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::reset(kudu::master::TSManager*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x40157d) #6 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::~unique_ptr() thirdparty/installed/tsan/include/c++/v1/memory:2471:19 (libmaster.so+0x4015eb) #7 kudu::master::Master::~Master() src/kudu/master/master.cc:263:1 (libmaster.so+0x3f7a4a) #8 kudu::master::Master::~Master() src/kudu/master/master.cc:261:19 (libmaster.so+0x3f7dc9) #9 std::__1::default_delete<kudu::master::Master>::operator()(kudu::master::Master*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x435627) #10 std::__1::unique_ptr<kudu::master::Master, std::__1::default_delete<kudu::master::Master> >::reset(kudu::master::Master*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x42e6ed) #11 kudu::master::MiniMaster::Shutdown() src/kudu/master/mini_master.cc:120:13 (libmaster.so+0x4c2612) ... Previous atomic write of size 4 at 0x7b4000008208 by thread T439 (mutexes: write M1141235379631443968): #0 __tsan_atomic32_compare_exchange_strong thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:780 (auto_rebalancer-test+0x33eb60) #1 base::subtle::Release_CompareAndSwap(int volatile*, int, int) /src/kudu/gutil/atomicops-internals-tsan.h:88:3 (libmaster.so+0x2e2b34) #2 kudu::rw_semaphore::unlock_shared() src/kudu/util/rw_semaphore.h:91:19 (libmaster.so+0x2e29c8) #3 kudu::rw_spinlock::unlock_shared() src/kudu/util/locks.h:99:10 (libmaster.so+0x2e28ef) #4 std::__1::shared_lock<kudu::rw_spinlock>::~shared_lock() /thirdparty/installed/tsan/include/c++/v1/shared_mutex:369:19 (libmaster.so+0x2e23e0) #5 kudu::master::TSManager::GetAllDescriptors(std::__1::vector<std::__1::shared_ptr<kudu::master::TSDescriptor>, std::__1::allocator<std::__1::shared_ptr<kudu::master::TSDescriptor> > >*) const src/kudu/master/ts_manager.cc:206:1 (libmaster.so+0x4adeb6) #6 kudu::master::AutoLeaderRebalancerTask::RunLeaderRebalancer() src/kudu/master/auto_leader_rebalancer.cc:405:16 (libmaster.so+0x2fb51b) #7 kudu::master::AutoLeaderRebalancerTask::RunLoop() src/kudu/master/auto_leader_rebalancer.cc:445:7 (libmaster.so+0x2fbaa9) This is a follow-up to 10efaf2. Change-Id: Iccd66d00280d22b37386230874937e5260f07f3b Reviewed-on: http://gerrit.cloudera.org:8080/21417 Reviewed-by: Wang Xixu <[email protected]> Tested-by: Alexey Serbin <[email protected]> Reviewed-by: Yifan Zhang <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Jun 12, 2024
This patch addresses a race reported by TSAN with traces like below: WARNING: ThreadSanitizer: data race (pid=11024) Write of size 8 at 0x7b580011f260 by thread T174: #0 kudu::tablet::OpState::set_start_time(kudu::MonoTime) src/kudu/tablet/ops/op.h:274:58 #1 kudu::tablet::WriteOp::Start() src/kudu/tablet/ops/write_op.cc:273:11 #2 kudu::tablet::OpDriver::Prepare() src/kudu/tablet/ops/op_driver.cc:329:7 #3 kudu::tablet::OpDriver::PrepareTask() src/kudu/tablet/ops/op_driver.cc:249:31 ... Previous read of size 8 at 0x7b580011f260 by thread T5 (mutexes: write M835553159786377312): #0 kudu::tablet::OpState::start_time() const src/kudu/tablet/ops/op.h:272:40 #1 kudu::tablet::WriteOp::ToString() const src/kudu/tablet/ops/write_op.cc:378:36 #2 kudu::tablet::OpDriver::ToStringUnlocked() const src/kudu/tablet/ops/op_driver.cc:209:23 #3 kudu::tablet::OpDriver::ToString() const src/kudu/tablet/ops/op_driver.cc:203:10 #4 kudu::tablet::TabletReplica::GetInFlightOps(...) const src/kudu/tablet/tablet_replica.cc:728:41 #5 kudu::tserver::TabletServerPathHandlers::HandleTransactionsPage(...) src/kudu/tserver/tserver_path_handlers.cc:286:14 ... Change-Id: I52de0840aa20f64cf15c7a9da2d553257c7e85e7 Reviewed-on: http://gerrit.cloudera.org:8080/21427 Tested-by: Kudu Jenkins Reviewed-by: Abhishek Chennaka <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Oct 8, 2024
Since the original implementation stored the random choice for replica selection integer in a variable that was initialized statically, the corresponding calls to libstdc++/libc++ runtime had been issued before the process called the main() function. That means some SSE4.2-specific instructions might be called since libkudu_client is unconditionally compiled with -msse4.2 flag, and there'd been no chance to call KuduClientBuilder::Build() that would verify the required features are present by calling CheckCPUFlags(). As a result, an attempt to run an application linked with kudu_client library at a machine lacking SSE4.2 support would result in a crash with SIGILL signal and a stack trace like below: #0 0x00007fc4b1b58162 in std::mersenne_twister_engine<...>::_M_gen_rand at include/c++/7.5.0/bits/random.tcc:408 #1 std::mersenne_twister_engine<...>::operator() at include/c++/7.5.0/bits/random.tcc:459 #2 0x00007fc4b1b1d65d in kudu::client::(anonymous namespace)::InitRandomSelectionInt at ../../../../../src/kudu/client/client-internal.cc:196 #3 0x00007fc4b1b1d6ef in __static_initialization_and_destruction_0 at ../../../../../src/kudu/client/client-internal.cc:198 #4 _GLOBAL__sub_I_client_internal.cc(void) at ../../../../../src/kudu/client/client-internal.cc:871 This patch addresses that deficiency, so now instead of unexpectedly crashing, the application would return an error upon at attempt to create an instance of KuduClient object. This is a follow-up to ccbbfb3. Change-Id: I11c2a29ef69a8c97c68330d261fdff64accebb0b Reviewed-on: http://gerrit.cloudera.org:8080/19828 Reviewed-by: Abhishek Chennaka <[email protected]> Reviewed-by: Wenzhe Zhou <[email protected]> Tested-by: Alexey Serbin <[email protected]> Reviewed-on: http://gerrit.cloudera.org:8080/19948 Reviewed-by: Yingchun Lai <[email protected]> Tested-by: Kudu Jenkins Reviewed-by: Yuqi Du <[email protected]> Reviewed-by: Yifan Zhang <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Oct 8, 2024
This update helps to prevent SIGSEGV in libunwind when running Kudu on aarch64 (in particular, Graviton3 instances in EC2). An example of stack trace looked like below, and it's similar to the stack mentioned in [1]: #0 access_mem (as=0x3304418 <local_addr_space>, addr=7745970402396146688, val=0xfffff325ca18, write=0, arg=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Ginit.c:337 #1 0x0000000000a97ac0 in is_plt_entry (c=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:43 #2 0x0000000000a97fdc in _ULaarch64_step (cursor=0xfffff325ce70) at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:171 #3 0x00000000025050c8 in kudu::StackTrace::Collect ( this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0) at src/kudu/util/debug-util.cc:612 #4 0x0000000002507f64 in kudu::StackTrace::Collect ( this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0) at src/kudu/util/debug-util.cc:579 [1] libunwind/libunwind#260 Change-Id: Ie34dc56f78abba537aa15dd3d9c0540157d9afa3 Reviewed-on: http://gerrit.cloudera.org:8080/20540 Tested-by: Kudu Jenkins Reviewed-by: Michael Smith <[email protected]> Reviewed-by: Mahesh Reddy <[email protected]> Reviewed-by: Abhishek Chennaka <[email protected]> (cherry picked from commit dd5fd45) Reviewed-on: http://gerrit.cloudera.org:8080/20542
zhangyifan27
pushed a commit
that referenced
this pull request
Oct 11, 2024
The race condition was reported by the TSAN like the following (with some information omitted): WARNING: ThreadSanitizer: data race (pid=1924273) Write of size 8 at 0x7b30002fe7c0 by thread T6 (mutexes: write M247597861, write M247597860, write M247597300): #0 std::__1::enable_if<(...), void>::type std::__1::swap<kudu::BlockId*>(...) thirdparty/installed/tsan/include/c++/v1/type_traits:4076:9 ... #4 kudu::tablet::RowSetMetadata::CommitRedoDeltaDataBlock(...) src/kudu/tablet/rowset_metadata.cc:197:22 #5 kudu::tablet::DeltaTracker::FlushDMS(...) src/kudu/tablet/delta_tracker.cc:826:23 #6 kudu::tablet::DeltaTracker::Flush(...) src/kudu/tablet/delta_tracker.cc:877:14 #7 kudu::tablet::DiskRowSet::FlushDeltas(...) src/kudu/tablet/diskrowset.cc:552:26 ... Previous read of size 8 at 0x7b30002fe7c0 by thread T34 (mutexes: write M247598319, write M919714229363433616, write M303002710007881612): #0 std::__1::vector<...>::size() const thirdparty/installed/tsan/include/c++/v1/vector:658:61 #1 kudu::tablet::RowSetMetadata::GetAllBlocks() const src/kudu/tablet/rowset_metadata.cc:306:37 #2 kudu::tablet::TabletMetadata::UpdateUnlocked(...) src/kudu/tablet/tablet_metadata.cc:677:40 #3 kudu::tablet::TabletMetadata::UpdateAndFlush(...) src/kudu/tablet/tablet_metadata.cc:549:5 #4 kudu::tablet::Tablet::FlushMetadata(...) src/kudu/tablet/tablet.cc:1992:21 #5 kudu::tablet::Tablet::HandleEmptyCompactionOrFlush() src/kudu/tablet/tablet.cc:2308:3 #6 kudu::tablet::Tablet::DeleteAncientDeletedRowsets() src/kudu/tablet/tablet.cc:3084:3 ... Change-Id: I07103269526d0ee98b0bb19e76e11f7d47a5b217 Reviewed-on: http://gerrit.cloudera.org:8080/21799 Reviewed-by: Abhishek Chennaka <[email protected]> Tested-by: Alexey Serbin <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Oct 11, 2024
This patch fixes a race in access to the RowSetMetadata::id_ field in the rollback scenario in the MajorCompactDeltaStoresWithColumnIds() method of the DiskRowSet class. Before this patch, TSAN would report warnings like below when running the MultiThreadedHybridClockTabletTest.UpdateNoMergeCompaction scenario: of the mt-tablet-test: Read of size 8 at 0x7b3400014780 by thread T30 (mutexes: write M76293278759445 9152, write M7098002): #0 kudu::tablet::RowSetMetadata::id() const src/kudu/tablet/rowset_metadata.h:100:31 (libtablet.so+0x346faa) #1 kudu::tablet::RowSetTree::Reset(...) src/kudu/tablet/rowset_tree.cc:190:48 (libtablet.so+0x4bf666) #2 kudu::tablet::Tablet::ModifyRowSetTree(...) src/kudu/tablet/tablet.cc:1490:3 (libtablet.so+0x323755) #3 kudu::tablet::Tablet::AtomicSwapRowSetsUnlocked(...) src/kudu/tablet/tablet.cc:1504:3 (libtablet.so+0x3239bc) #4 kudu::tablet::Tablet::AtomicSwapRowSets(...) src/kudu/tablet/tablet.cc:1496:3 (libtablet.so+0x3238f9) ... Previous write of size 8 at 0x7b3400014780 by thread T12 (mutexes: write M625572878699880144, write M530715863088620288, write M525367769810683784): #0 kudu::tablet::RowSetMetadata::LoadFromPB(...) src/kudu/tablet/rowset_metadata.cc:77:7 (libtablet.so+0x4f9f03) #1 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...)::$_0::operator()() const src/kudu/tablet/diskrowset.cc:603:23 (libtablet.so+0x46eddf) #2 kudu::ScopedCleanup<kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...)::$_0>::~ScopedCleanup() src/kudu/util/scoped_cleanup.h:51:7 (libtablet.so+0x46cc5a) #3 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...) src/kudu/tablet/diskrowset.cc:636:1 (libtablet.so+0x46c5c9) #4 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(...) src/kudu/tablet/diskrowset.cc:570:10 (libtablet.so+0x46c013) ... SUMMARY: ThreadSanitizer: data race src/kudu/tablet/rowset_metadata.h:100:31 in kudu::tablet::RowSetMetadata::id() const Change-Id: I4b09575616e754b7dbb24586293f128e361b9360 Reviewed-on: http://gerrit.cloudera.org:8080/21779 Reviewed-by: Mahesh Reddy <[email protected]> Tested-by: Alexey Serbin <[email protected]> Reviewed-by: Yingchun Lai <[email protected]>
zhangyifan27
pushed a commit
that referenced
this pull request
Feb 5, 2025
The thread pool of the DNS resolver should be shut down along with the messenger in ServerBase to prevent retrying of RPCs that failed as a collateral of the shutdown process in progress. Those RPCs might be retried by invoking rpc::Proxy::RefreshDnsAndEnqueueRequest(), etc. On the related note, I also added a guard to protect ThreadPool::tokens_ in the destructor of the ThreadPool class, as elsewhere. I also snuck in an update to call DCHECK() in a loop only when DCHECK_IS_ON() macro evaluates to 'true'. This addresses flakiness reported at least in one of the RemoteKsckTest scenarios (e.g., TestFilterOnNotabletTable in [1]). One of the related TSAN reports looked like below: RemoteKsckTest.TestFilterOnNotabletTable: WARNING: ThreadSanitizer: data race Read of size 8 at 0x7b54001e5118 by main thread: #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::size() const #1 std::__1::unordered_set<kudu::ThreadPoolToken*, ...>::size() const #2 kudu::ThreadPool::~ThreadPool() ... #6 kudu::kserver::KuduServer::~KuduServer() #7 kudu::tserver::TabletServer::~TabletServer() ... Previous write of size 8 at 0x7b54001e5118 by thread T262 ...: #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::remove(...) ... #4 kudu::ThreadPool::ReleaseToken(...) #5 kudu::ThreadPoolToken::~ThreadPoolToken() ... apache#24 kudu::consensus::LeaderElection::~LeaderElection() ... apache#35 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...) ... apache#41 kudu::DnsResolver::RefreshAddressesAsync() ... Thread T262 'dns-resolver [w' (tid=29102, running) created by thread T182 at: #0 pthread_create #1 kudu::Thread::StartThread(...) #2 kudu::Thread::Create(...) #3 kudu::ThreadPool::CreateThread() #4 kudu::ThreadPool::DoSubmit(..., kudu::ThreadPoolToken*) #5 kudu::ThreadPool::Submit(...) #6 kudu::DnsResolver::RefreshAddressesAsync(..) #7 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...) #8 kudu::rpc::Proxy::AsyncRequest(...) ... #15 kudu::rpc::OutboundCall::CallCallback() apache#16 kudu::rpc::OutboundCall::SetFailed() apache#17 kudu::rpc::Connection::Shutdown() apache#18 kudu::rpc::ReactorThread::ShutdownInternal() ... apache#25 kudu::rpc::ReactorThread::RunThread() ... [1] http://dist-test.cloudera.org:8080/test_drilldown?test_name=ksck_remote-test Change-Id: I525f1078a349dbd2926938bb4fcc3e80888dfbb4 Reviewed-on: http://gerrit.cloudera.org:8080/22434 Tested-by: Alexey Serbin <[email protected]> Reviewed-by: Abhishek Chennaka <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.