Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[catalog_manager] KUDU-3344 clean up deleted tables and tablets #3

Merged
merged 1 commit into from
Jan 7, 2022

Conversation

zhangyifan27
Copy link
Owner

No description provided.

src/kudu/integration-tests/master-stress-test.cc Outdated Show resolved Hide resolved
src/kudu/master/catalog_manager.cc Outdated Show resolved Hide resolved
src/kudu/master/catalog_manager.cc Outdated Show resolved Hide resolved
src/kudu/master/catalog_manager.cc Outdated Show resolved Hide resolved
@zhangyifan27 zhangyifan27 force-pushed the dev branch 2 times, most recently from 1a45cd8 to ef3171c Compare January 7, 2022 10:15
src/kudu/master/catalog_manager.cc Outdated Show resolved Hide resolved
src/kudu/master/catalog_manager.cc Outdated Show resolved Hide resolved
Change-Id: Idefa2ee2f5108ba913fe0057a4061c3c28351547
@zhangyifan27 zhangyifan27 changed the title [catalog_manager] KUDU-3344 add a task to cleanup deleted entries [catalog_manager] KUDU-3344 clean up deleted tables and tablets Jan 7, 2022
@zhangyifan27 zhangyifan27 merged commit abe4ecf into KUDU-3344 Jan 7, 2022
zhangyifan27 pushed a commit that referenced this pull request May 4, 2023
Since the original implementation stored the random choice for replica
selection integer in a variable that was initialized statically, the
corresponding calls to libstdc++/libc++ runtime had been issued before
the process called the main() function.  That means some SSE4.2-specific
instructions might be called since libkudu_client is unconditionally
compiled with -msse4.2 flag, and there'd been no chance to call
KuduClientBuilder::Build() that would verify the required features are
present by calling CheckCPUFlags().  As a result, an attempt to run
an application linked with kudu_client library at a machine lacking
SSE4.2 support would result in a crash with SIGILL signal and a stack
trace like below:

  #0  0x00007fc4b1b58162 in std::mersenne_twister_engine<...>::_M_gen_rand
      at include/c++/7.5.0/bits/random.tcc:408
  #1  std::mersenne_twister_engine<...>::operator()
      at include/c++/7.5.0/bits/random.tcc:459
  #2  0x00007fc4b1b1d65d in kudu::client::(anonymous namespace)::InitRandomSelectionInt
      at ../../../../../src/kudu/client/client-internal.cc:196
  #3  0x00007fc4b1b1d6ef in __static_initialization_and_destruction_0
      at ../../../../../src/kudu/client/client-internal.cc:198
  #4  _GLOBAL__sub_I_client_internal.cc(void)
      at ../../../../../src/kudu/client/client-internal.cc:871

This patch addresses that deficiency, so now instead of unexpectedly
crashing, the application would return an error upon at attempt to
create an instance of KuduClient object.

This is a follow-up to ccbbfb3.

Change-Id: I11c2a29ef69a8c97c68330d261fdff64accebb0b
Reviewed-on: http://gerrit.cloudera.org:8080/19828
Reviewed-by: Abhishek Chennaka <[email protected]>
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Oct 25, 2023
This update helps to prevent SIGSEGV in libunwind when running Kudu on
aarch64 (in particular, Graviton3 instances in EC2).  An example of stack
trace looked like below, and it's similar to the stack mentioned in [1]:

  #0  access_mem (as=0x3304418 <local_addr_space>, addr=7745970402396146688,
      val=0xfffff325ca18, write=0, arg=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Ginit.c:337
  #1  0x0000000000a97ac0 in is_plt_entry (c=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:43
  #2  0x0000000000a97fdc in _ULaarch64_step (cursor=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:171
  #3  0x00000000025050c8 in kudu::StackTrace::Collect (
      this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0)
      at src/kudu/util/debug-util.cc:612
  #4  0x0000000002507f64 in kudu::StackTrace::Collect (
      this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0)
      at src/kudu/util/debug-util.cc:579

[1] libunwind/libunwind#260

Change-Id: Ie34dc56f78abba537aa15dd3d9c0540157d9afa3
Reviewed-on: http://gerrit.cloudera.org:8080/20540
Tested-by: Kudu Jenkins
Reviewed-by: Michael Smith <[email protected]>
Reviewed-by: Mahesh Reddy <[email protected]>
Reviewed-by: Abhishek Chennaka <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Jun 12, 2024
It turned out that auto leader rebalancing task wasn't explicitly
shutdown upon shutting down catalog manager.  That lead to race
conditions as reported by TSAN, at least in test scenarios (see below).
This patch addresses the issue.

  WARNING: ThreadSanitizer: data race (pid=23827)
    Write of size 1 at 0x7b4000008208 by main thread:
      #0 AnnotateRWLockDestroy thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp:264 (auto_rebalancer-test+0x33575e)
      #1 kudu::rw_spinlock::~rw_spinlock() src/kudu/util/locks.h:89:5 (libmaster.so+0x359376)
      #2 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:108:1 (libmaster.so+0x4ad201)
      #3 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:107:25 (libmaster.so+0x4ad229)
      #4 std::__1::default_delete<kudu::master::TSManager>::operator()(kudu::master::TSManager*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x407ce7)
      #5 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::reset(kudu::master::TSManager*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x40157d)
      #6 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::~unique_ptr() thirdparty/installed/tsan/include/c++/v1/memory:2471:19 (libmaster.so+0x4015eb)
      #7 kudu::master::Master::~Master() src/kudu/master/master.cc:263:1 (libmaster.so+0x3f7a4a)
      #8 kudu::master::Master::~Master() src/kudu/master/master.cc:261:19 (libmaster.so+0x3f7dc9)
      #9 std::__1::default_delete<kudu::master::Master>::operator()(kudu::master::Master*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x435627)
      #10 std::__1::unique_ptr<kudu::master::Master, std::__1::default_delete<kudu::master::Master> >::reset(kudu::master::Master*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x42e6ed)
      #11 kudu::master::MiniMaster::Shutdown() src/kudu/master/mini_master.cc:120:13 (libmaster.so+0x4c2612)
    ...
    Previous atomic write of size 4 at 0x7b4000008208 by thread T439 (mutexes: write M1141235379631443968):
      #0 __tsan_atomic32_compare_exchange_strong thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:780 (auto_rebalancer-test+0x33eb60)
      #1 base::subtle::Release_CompareAndSwap(int volatile*, int, int) /src/kudu/gutil/atomicops-internals-tsan.h:88:3 (libmaster.so+0x2e2b34)
      #2 kudu::rw_semaphore::unlock_shared() src/kudu/util/rw_semaphore.h:91:19 (libmaster.so+0x2e29c8)
      #3 kudu::rw_spinlock::unlock_shared() src/kudu/util/locks.h:99:10 (libmaster.so+0x2e28ef)
      #4 std::__1::shared_lock<kudu::rw_spinlock>::~shared_lock() /thirdparty/installed/tsan/include/c++/v1/shared_mutex:369:19 (libmaster.so+0x2e23e0)
      #5 kudu::master::TSManager::GetAllDescriptors(std::__1::vector<std::__1::shared_ptr<kudu::master::TSDescriptor>, std::__1::allocator<std::__1::shared_ptr<kudu::master::TSDescriptor> > >*) const src/kudu/master/ts_manager.cc:206:1 (libmaster.so+0x4adeb6)
      #6 kudu::master::AutoLeaderRebalancerTask::RunLeaderRebalancer() src/kudu/master/auto_leader_rebalancer.cc:405:16 (libmaster.so+0x2fb51b)
      #7 kudu::master::AutoLeaderRebalancerTask::RunLoop() src/kudu/master/auto_leader_rebalancer.cc:445:7 (libmaster.so+0x2fbaa9)

This is a follow-up to 10efaf2.

Change-Id: Iccd66d00280d22b37386230874937e5260f07f3b
Reviewed-on: http://gerrit.cloudera.org:8080/21417
Reviewed-by: Wang Xixu <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Yifan Zhang <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Jun 12, 2024
This patch addresses a race reported by TSAN with traces like below:

WARNING: ThreadSanitizer: data race (pid=11024)
  Write of size 8 at 0x7b580011f260 by thread T174:
    #0 kudu::tablet::OpState::set_start_time(kudu::MonoTime) src/kudu/tablet/ops/op.h:274:58
    #1 kudu::tablet::WriteOp::Start() src/kudu/tablet/ops/write_op.cc:273:11
    #2 kudu::tablet::OpDriver::Prepare() src/kudu/tablet/ops/op_driver.cc:329:7
    #3 kudu::tablet::OpDriver::PrepareTask() src/kudu/tablet/ops/op_driver.cc:249:31
    ...

  Previous read of size 8 at 0x7b580011f260 by thread T5 (mutexes: write M835553159786377312):
    #0 kudu::tablet::OpState::start_time() const src/kudu/tablet/ops/op.h:272:40
    #1 kudu::tablet::WriteOp::ToString() const src/kudu/tablet/ops/write_op.cc:378:36
    #2 kudu::tablet::OpDriver::ToStringUnlocked() const src/kudu/tablet/ops/op_driver.cc:209:23
    #3 kudu::tablet::OpDriver::ToString() const src/kudu/tablet/ops/op_driver.cc:203:10
    #4 kudu::tablet::TabletReplica::GetInFlightOps(...) const src/kudu/tablet/tablet_replica.cc:728:41
    #5 kudu::tserver::TabletServerPathHandlers::HandleTransactionsPage(...) src/kudu/tserver/tserver_path_handlers.cc:286:14
    ...

Change-Id: I52de0840aa20f64cf15c7a9da2d553257c7e85e7
Reviewed-on: http://gerrit.cloudera.org:8080/21427
Tested-by: Kudu Jenkins
Reviewed-by: Abhishek Chennaka <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Oct 8, 2024
Since the original implementation stored the random choice for replica
selection integer in a variable that was initialized statically, the
corresponding calls to libstdc++/libc++ runtime had been issued before
the process called the main() function.  That means some SSE4.2-specific
instructions might be called since libkudu_client is unconditionally
compiled with -msse4.2 flag, and there'd been no chance to call
KuduClientBuilder::Build() that would verify the required features are
present by calling CheckCPUFlags().  As a result, an attempt to run
an application linked with kudu_client library at a machine lacking
SSE4.2 support would result in a crash with SIGILL signal and a stack
trace like below:

  #0  0x00007fc4b1b58162 in std::mersenne_twister_engine<...>::_M_gen_rand
      at include/c++/7.5.0/bits/random.tcc:408
  #1  std::mersenne_twister_engine<...>::operator()
      at include/c++/7.5.0/bits/random.tcc:459
  #2  0x00007fc4b1b1d65d in kudu::client::(anonymous namespace)::InitRandomSelectionInt
      at ../../../../../src/kudu/client/client-internal.cc:196
  #3  0x00007fc4b1b1d6ef in __static_initialization_and_destruction_0
      at ../../../../../src/kudu/client/client-internal.cc:198
  #4  _GLOBAL__sub_I_client_internal.cc(void)
      at ../../../../../src/kudu/client/client-internal.cc:871

This patch addresses that deficiency, so now instead of unexpectedly
crashing, the application would return an error upon at attempt to
create an instance of KuduClient object.

This is a follow-up to ccbbfb3.

Change-Id: I11c2a29ef69a8c97c68330d261fdff64accebb0b
Reviewed-on: http://gerrit.cloudera.org:8080/19828
Reviewed-by: Abhishek Chennaka <[email protected]>
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
Reviewed-on: http://gerrit.cloudera.org:8080/19948
Reviewed-by: Yingchun Lai <[email protected]>
Tested-by: Kudu Jenkins
Reviewed-by: Yuqi Du <[email protected]>
Reviewed-by: Yifan Zhang <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Oct 8, 2024
This update helps to prevent SIGSEGV in libunwind when running Kudu on
aarch64 (in particular, Graviton3 instances in EC2).  An example of stack
trace looked like below, and it's similar to the stack mentioned in [1]:

  #0  access_mem (as=0x3304418 <local_addr_space>, addr=7745970402396146688,
      val=0xfffff325ca18, write=0, arg=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Ginit.c:337
  #1  0x0000000000a97ac0 in is_plt_entry (c=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:43
  #2  0x0000000000a97fdc in _ULaarch64_step (cursor=0xfffff325ce70)
      at thirdparty/src/libunwind-1.6.2/src/aarch64/Gstep.c:171
  #3  0x00000000025050c8 in kudu::StackTrace::Collect (
      this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0)
      at src/kudu/util/debug-util.cc:612
  #4  0x0000000002507f64 in kudu::StackTrace::Collect (
      this=this@entry=0xfffff325d7d8, skip_frames=skip_frames@entry=0)
      at src/kudu/util/debug-util.cc:579

[1] libunwind/libunwind#260

Change-Id: Ie34dc56f78abba537aa15dd3d9c0540157d9afa3
Reviewed-on: http://gerrit.cloudera.org:8080/20540
Tested-by: Kudu Jenkins
Reviewed-by: Michael Smith <[email protected]>
Reviewed-by: Mahesh Reddy <[email protected]>
Reviewed-by: Abhishek Chennaka <[email protected]>
(cherry picked from commit dd5fd45)
Reviewed-on: http://gerrit.cloudera.org:8080/20542
zhangyifan27 pushed a commit that referenced this pull request Oct 11, 2024
The race condition was reported by the TSAN like the following
(with some information omitted):

  WARNING: ThreadSanitizer: data race (pid=1924273)
    Write of size 8 at 0x7b30002fe7c0 by thread T6 (mutexes: write M247597861, write M247597860, write M247597300):
      #0 std::__1::enable_if<(...), void>::type std::__1::swap<kudu::BlockId*>(...) thirdparty/installed/tsan/include/c++/v1/type_traits:4076:9
      ...
      #4 kudu::tablet::RowSetMetadata::CommitRedoDeltaDataBlock(...) src/kudu/tablet/rowset_metadata.cc:197:22
      #5 kudu::tablet::DeltaTracker::FlushDMS(...) src/kudu/tablet/delta_tracker.cc:826:23
      #6 kudu::tablet::DeltaTracker::Flush(...) src/kudu/tablet/delta_tracker.cc:877:14
      #7 kudu::tablet::DiskRowSet::FlushDeltas(...) src/kudu/tablet/diskrowset.cc:552:26
      ...

    Previous read of size 8 at 0x7b30002fe7c0 by thread T34 (mutexes: write M247598319, write M919714229363433616, write M303002710007881612):
      #0 std::__1::vector<...>::size() const thirdparty/installed/tsan/include/c++/v1/vector:658:61
      #1 kudu::tablet::RowSetMetadata::GetAllBlocks() const src/kudu/tablet/rowset_metadata.cc:306:37
      #2 kudu::tablet::TabletMetadata::UpdateUnlocked(...) src/kudu/tablet/tablet_metadata.cc:677:40
      #3 kudu::tablet::TabletMetadata::UpdateAndFlush(...) src/kudu/tablet/tablet_metadata.cc:549:5
      #4 kudu::tablet::Tablet::FlushMetadata(...) src/kudu/tablet/tablet.cc:1992:21
      #5 kudu::tablet::Tablet::HandleEmptyCompactionOrFlush() src/kudu/tablet/tablet.cc:2308:3
      #6 kudu::tablet::Tablet::DeleteAncientDeletedRowsets() src/kudu/tablet/tablet.cc:3084:3
      ...

Change-Id: I07103269526d0ee98b0bb19e76e11f7d47a5b217
Reviewed-on: http://gerrit.cloudera.org:8080/21799
Reviewed-by: Abhishek Chennaka <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Oct 11, 2024
This patch fixes a race in access to the RowSetMetadata::id_ field
in the rollback scenario in the MajorCompactDeltaStoresWithColumnIds()
method of the DiskRowSet class.

Before this patch, TSAN would report warnings like below when running
the MultiThreadedHybridClockTabletTest.UpdateNoMergeCompaction scenario:
of the mt-tablet-test:

    Read of size 8 at 0x7b3400014780 by thread T30 (mutexes: write M76293278759445
  9152, write M7098002):
      #0 kudu::tablet::RowSetMetadata::id() const src/kudu/tablet/rowset_metadata.h:100:31 (libtablet.so+0x346faa)
      #1 kudu::tablet::RowSetTree::Reset(...) src/kudu/tablet/rowset_tree.cc:190:48 (libtablet.so+0x4bf666)
      #2 kudu::tablet::Tablet::ModifyRowSetTree(...) src/kudu/tablet/tablet.cc:1490:3 (libtablet.so+0x323755)
      #3 kudu::tablet::Tablet::AtomicSwapRowSetsUnlocked(...) src/kudu/tablet/tablet.cc:1504:3 (libtablet.so+0x3239bc)
      #4 kudu::tablet::Tablet::AtomicSwapRowSets(...) src/kudu/tablet/tablet.cc:1496:3 (libtablet.so+0x3238f9)
      ...

    Previous write of size 8 at 0x7b3400014780 by thread T12 (mutexes: write M625572878699880144, write M530715863088620288, write M525367769810683784):
      #0 kudu::tablet::RowSetMetadata::LoadFromPB(...) src/kudu/tablet/rowset_metadata.cc:77:7 (libtablet.so+0x4f9f03)
      #1 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...)::$_0::operator()() const src/kudu/tablet/diskrowset.cc:603:23 (libtablet.so+0x46eddf)
      #2 kudu::ScopedCleanup<kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...)::$_0>::~ScopedCleanup() src/kudu/util/scoped_cleanup.h:51:7 (libtablet.so+0x46cc5a)
      #3 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(...) src/kudu/tablet/diskrowset.cc:636:1 (libtablet.so+0x46c5c9)
      #4 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(...) src/kudu/tablet/diskrowset.cc:570:10 (libtablet.so+0x46c013)
      ...

  SUMMARY: ThreadSanitizer: data race src/kudu/tablet/rowset_metadata.h:100:31 in kudu::tablet::RowSetMetadata::id() const

Change-Id: I4b09575616e754b7dbb24586293f128e361b9360
Reviewed-on: http://gerrit.cloudera.org:8080/21779
Reviewed-by: Mahesh Reddy <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Yingchun Lai <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Feb 5, 2025
The thread pool of the DNS resolver should be shut down along with the
messenger in ServerBase to prevent retrying of RPCs that failed as a
collateral of the shutdown process in progress.  Those RPCs might be
retried by invoking rpc::Proxy::RefreshDnsAndEnqueueRequest(), etc.

On the related note, I also added a guard to protect ThreadPool::tokens_
in the destructor of the ThreadPool class, as elsewhere.  I also snuck
in an update to call DCHECK() in a loop only when DCHECK_IS_ON()
macro evaluates to 'true'.

This addresses flakiness reported at least in one of the RemoteKsckTest
scenarios (e.g., TestFilterOnNotabletTable in [1]).  One of the related
TSAN reports looked like below:

RemoteKsckTest.TestFilterOnNotabletTable: WARNING: ThreadSanitizer: data race
  Read of size 8 at 0x7b54001e5118 by main thread:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::size() const
    #1 std::__1::unordered_set<kudu::ThreadPoolToken*, ...>::size() const
    #2 kudu::ThreadPool::~ThreadPool()
    ...
    #6 kudu::kserver::KuduServer::~KuduServer()
    #7 kudu::tserver::TabletServer::~TabletServer()
    ...

  Previous write of size 8 at 0x7b54001e5118 by thread T262 ...:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::remove(...)
    ...
    #4 kudu::ThreadPool::ReleaseToken(...)
    #5 kudu::ThreadPoolToken::~ThreadPoolToken()
    ...
    apache#24 kudu::consensus::LeaderElection::~LeaderElection()
    ...
    apache#35 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    ...
    apache#41 kudu::DnsResolver::RefreshAddressesAsync()
    ...

  Thread T262 'dns-resolver [w' (tid=29102, running) created by thread T182 at:
    #0 pthread_create
    #1 kudu::Thread::StartThread(...)
    #2 kudu::Thread::Create(...)
    #3 kudu::ThreadPool::CreateThread()
    #4 kudu::ThreadPool::DoSubmit(..., kudu::ThreadPoolToken*)
    #5 kudu::ThreadPool::Submit(...)
    #6 kudu::DnsResolver::RefreshAddressesAsync(..)
    #7 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    #8 kudu::rpc::Proxy::AsyncRequest(...)
    ...
    #15 kudu::rpc::OutboundCall::CallCallback()
    apache#16 kudu::rpc::OutboundCall::SetFailed()
    apache#17 kudu::rpc::Connection::Shutdown()
    apache#18 kudu::rpc::ReactorThread::ShutdownInternal()
    ...
    apache#25 kudu::rpc::ReactorThread::RunThread()
    ...

[1] http://dist-test.cloudera.org:8080/test_drilldown?test_name=ksck_remote-test

Change-Id: I525f1078a349dbd2926938bb4fcc3e80888dfbb4
Reviewed-on: http://gerrit.cloudera.org:8080/22434
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Abhishek Chennaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant