-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Add rollback mechanism for failed CreateCDCStream requests #18934
Labels
area/cdcsdk
CDC SDK
area/ysql
Yugabyte SQL (YSQL)
kind/new-feature
This is a request for a completely new feature
priority/medium
Medium priority issue
Comments
dr0pdb
added
kind/enhancement
This is an enhancement of an existing feature
area/ysql
Yugabyte SQL (YSQL)
labels
Aug 31, 2023
dr0pdb
changed the title
[YSQL] Support DDL atomicity for all CDCSDK stream commands in YSQL
[YSQL] Support DDL atomicity for all publication/replication slot commands in YSQL
Sep 20, 2023
yugabyte-ci
added
kind/new-feature
This is a request for a completely new feature
and removed
kind/enhancement
This is an enhancement of an existing feature
labels
Dec 22, 2023
dr0pdb
added a commit
that referenced
this issue
Jan 12, 2024
… streams Summary: The creation of a CDCSDK stream involves multiple phases. In case of failures in-between, we need to rollback the creation. This revision implements that by using a `ScopeExit` to mark the CDCSDK stream for deletion in case of failures. The overall logic of rollback is as follows: 1. Upon failure, the stream is marked for deletion (state in stream metadata is set to `DELETING`) 2. The `CatalogManager::RunXClusterBgTasks` function called from CatalogManagerBgTasks, deletes the sys-catalog entry for the stream marked for deletion and updates the CDC state table checkpoints to `OpId::Max` 3. The `UpdatePeersAndMetrics` thread running in cdc_service looks up entries from CDC state table with `OpId::Max()`, releases the retention barriers and deletes the rows from the CDC state table. Step 2 and 3 are existing code. ##### Limitations If the create stream operation fails, there are some scenarios where the cleanup may not be complete such as leadership change / crash. This can be checked by listing the streams and checking if there was a new stream-id that was created. In such cases, this stream should be deleted manually by the user using yb-admin. 1. If the stream can actually be manually deleted, the retention barriers will be released shortly thereafter 2. If the above manual step is not done, then retention barriers will be released after the timeout of 4hrs Eventually, the rollback should be done inline with how we have implemented DDL atomicity for other statements such as `AlterTable` which would remove these limitations. This has been left as a future exercise with a TODO comment. Jira: DB-7785 Test Plan: New tests ``` ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeSysCatalogEntry ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeInMemoryCommit ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterInMemoryStateCommit ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterDummy ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterRetentionBarriers ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureWhileStoringConsistentSnapshot ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterStoringConsistentSnapshot ``` Reviewers: asrinivasan, skumar, xCluster, hsunder Reviewed By: asrinivasan, hsunder Subscribers: ybase, ycdcxcluster, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D30918
dr0pdb
added a commit
that referenced
this issue
Jan 15, 2024
…ts test Summary: In https://phorge.dev.yugabyte.com/D30918, we introduced support for rolling back a failed CDC stream. As part of the rollback revision, we enabled the UpdatePeersAndMetrics thread earlier in the flow (SetCDCServiceEnabled()). The test TestReleaseResourcesOnUnpolledSplitTablets started intermittently failing. The failure happens because the UpdatePeersAndMetrics thread was caching the stream metadata without the stream creation time. The fix is to make the UpdatePeersAndMetrics thread refresh the cached stream metadata if the state is in Initialized state. Jira: DB-7785 Test Plan: Jenkins: test regex: .*CDCSDK.* ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesOnUnpolledSplitTablets -n 5 Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D31702
dr0pdb
added a commit
that referenced
this issue
Jan 19, 2024
…ream for CDCSDK streams Summary: **Backport Description:** Had minor merge conflicts in tests. **Original Description:** Original commit: 0cc9693 / D30918 The creation of a CDCSDK stream involves multiple phases. In case of failures in-between, we need to rollback the creation. This revision implements that by using a `ScopeExit` to mark the CDCSDK stream for deletion in case of failures. The overall logic of rollback is as follows: 1. Upon failure, the stream is marked for deletion (state in stream metadata is set to `DELETING`) 2. The `CatalogManager::RunXClusterBgTasks` function called from CatalogManagerBgTasks, deletes the sys-catalog entry for the stream marked for deletion and updates the CDC state table checkpoints to `OpId::Max` 3. The `UpdatePeersAndMetrics` thread running in cdc_service looks up entries from CDC state table with `OpId::Max()`, releases the retention barriers and deletes the rows from the CDC state table. Step 2 and 3 are existing code. If the create stream operation fails, there are some scenarios where the cleanup may not be complete such as leadership change / crash. This can be checked by listing the streams and checking if there was a new stream-id that was created. In such cases, this stream should be deleted manually by the user using yb-admin. 1. If the stream can actually be manually deleted, the retention barriers will be released shortly thereafter 2. If the above manual step is not done, then retention barriers will be released after the timeout of 4hrs Eventually, the rollback should be done inline with how we have implemented DDL atomicity for other statements such as `AlterTable` which would remove these limitations. This has been left as a future exercise with a TODO comment. Jira: DB-7785 Test Plan: New tests ``` ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeSysCatalogEntry ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeInMemoryCommit ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterInMemoryStateCommit ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterDummy ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterRetentionBarriers ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureWhileStoringConsistentSnapshot ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterStoringConsistentSnapshot ``` Reviewers: asrinivasan, skumar, xCluster, hsunder Reviewed By: hsunder Subscribers: bogdan, ycdcxcluster, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31759
dr0pdb
added a commit
that referenced
this issue
Jan 19, 2024
…olledSplitTablets test Summary: Original commit: 9c370a0 / D31702 In https://phorge.dev.yugabyte.com/D30918, we introduced support for rolling back a failed CDC stream. As part of the rollback revision, we enabled the UpdatePeersAndMetrics thread earlier in the flow (SetCDCServiceEnabled()). The test TestReleaseResourcesOnUnpolledSplitTablets started intermittently failing. The failure happens because the UpdatePeersAndMetrics thread was caching the stream metadata without the stream creation time. The fix is to make the UpdatePeersAndMetrics thread refresh the cached stream metadata if the state is in Initialized state. Jira: DB-7785 Test Plan: Jenkins: test regex: .*CDCSDK.* ./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesOnUnpolledSplitTablets -n 5 Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31806
yugabyte-ci
changed the title
[YSQL] Support DDL atomicity for all publication/replication slot commands in YSQL
[YSQL] Add rollback mechanism for failed CreateCDCStream requests
Mar 8, 2024
1 task
Follow ups moved to a new issue: #22675 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/cdcsdk
CDC SDK
area/ysql
Yugabyte SQL (YSQL)
kind/new-feature
This is a request for a completely new feature
priority/medium
Medium priority issue
Jira Link: DB-7785
Description
Support DDL atomicity for all publication/replication slot commands in YSQL
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: