-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDCSDK] Address performance impact of increasing retention period of intents #21580
Labels
2024.1 Backport Required
2024.1.1_blocker
area/docdb
YugabyteDB core features
jira-originated
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
yugabyte-ci
added
area/docdb
YugabyteDB core features
jira-originated
kind/enhancement
This is an enhancement of an existing feature
priority/low
Low priority
labels
Mar 19, 2024
rthallamko3
added
priority/medium
Medium priority issue
and removed
priority/low
Low priority
labels
Mar 19, 2024
1 task
es1024
added a commit
that referenced
this issue
Apr 5, 2024
Summary: When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied). This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well. Jira: DB-10466 Test Plan: Jenkins. Reviewers: sergei, mbautin Reviewed By: sergei, mbautin Subscribers: yql, ybase, bogdan, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D33131
yusong-yan
added a commit
that referenced
this issue
Apr 18, 2024
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
es1024
added a commit
that referenced
this issue
May 6, 2024
…er intent SST files only retained for CDC"" Summary: This reverts commit D34245 / 89316bd, which reverted D33131 / fb7c86c due to a segmentation fault introduced due to `min_running_ht` being initialized too early; this issue is now fixed with D34389 / 138b81a. Jira: DB-10466, DB-10780 Test Plan: Jenkins Reviewers: yyan, sergei Reviewed By: yyan Subscribers: rthallam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34745
es1024
added a commit
that referenced
this issue
May 15, 2024
…d for CDC Summary: Original commit: fb7c86c / D33131 When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied). This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well. Jira: DB-10466 Test Plan: Jenkins. Reviewers: sergei, mbautin, rthallam Reviewed By: rthallam Subscribers: rthallam, bogdan, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34746
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
…er intent SST files only retained for CDC"" Summary: This reverts commit D34245 / 89316bd, which reverted D33131 / fb7c86c due to a segmentation fault introduced due to `min_running_ht` being initialized too early; this issue is now fixed with D34389 / 138b81a. Jira: DB-10466, DB-10780 Test Plan: Jenkins Reviewers: yyan, sergei Reviewed By: yyan Subscribers: rthallam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34745
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…d for CDC Summary: Original commit: fb7c86c / D33131 When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied). This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well. Jira: DB-10466 Test Plan: Jenkins. Reviewers: sergei, mbautin, rthallam Reviewed By: rthallam Subscribers: rthallam, bogdan, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34746
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2024.1 Backport Required
2024.1.1_blocker
area/docdb
YugabyteDB core features
jira-originated
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-10466
Setup: Two RF3 3x c5.4xlarge universe with provisioned IOPS set to 10000, 4 million rows, taking a constant stream of single-row updates in full transactions from 6 threads. One universe has no CDC streams (non-CDC universe), and the other universe has a CDC stream that is lagging behind – no changes are being sent (CDC universe). This performance issue was observed after the memory issue #21290 was resolved.
The following graph shows the throughput of (blue) the non-CDC universe, and (green) the CDC universe. (Prometheus)
This issue tracks the fixes needed to address the above performance issue.
The text was updated successfully, but these errors were encountered: