Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Add rollback mechanism for failed CreateCDCStream requests #18934

Closed
1 task done
dr0pdb opened this issue Aug 31, 2023 · 1 comment
Closed
1 task done

[YSQL] Add rollback mechanism for failed CreateCDCStream requests #18934

dr0pdb opened this issue Aug 31, 2023 · 1 comment
Assignees
Labels
area/cdcsdk CDC SDK area/ysql Yugabyte SQL (YSQL) kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue

Comments

@dr0pdb
Copy link
Contributor

dr0pdb commented Aug 31, 2023

Jira Link: DB-7785

Description

Support DDL atomicity for all publication/replication slot commands in YSQL

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@dr0pdb dr0pdb added kind/enhancement This is an enhancement of an existing feature area/ysql Yugabyte SQL (YSQL) labels Aug 31, 2023
@dr0pdb dr0pdb self-assigned this Aug 31, 2023
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Aug 31, 2023
@dr0pdb dr0pdb changed the title [YSQL] Support DDL atomicity for all CDCSDK stream commands in YSQL [YSQL] Support DDL atomicity for all publication/replication slot commands in YSQL Sep 20, 2023
@dr0pdb dr0pdb added the area/cdcsdk CDC SDK label Sep 22, 2023
@yugabyte-ci yugabyte-ci added kind/new-feature This is a request for a completely new feature and removed kind/enhancement This is an enhancement of an existing feature labels Dec 22, 2023
dr0pdb added a commit that referenced this issue Jan 12, 2024
… streams

Summary:
The creation of a CDCSDK stream involves multiple phases. In case of failures in-between, we need to rollback the creation.

This revision implements that by using a `ScopeExit` to mark the CDCSDK stream for deletion in case of failures.

The overall logic of rollback is as follows:
1. Upon failure, the stream is marked for deletion (state in stream metadata is set to `DELETING`)
2. The `CatalogManager::RunXClusterBgTasks` function called from CatalogManagerBgTasks, deletes the sys-catalog entry for the stream marked for deletion and updates the CDC state table checkpoints to `OpId::Max`
3. The `UpdatePeersAndMetrics` thread running in cdc_service looks up entries from CDC state table with `OpId::Max()`, releases the retention barriers and deletes the rows from the CDC state table.

Step 2 and 3 are existing code.

##### Limitations
If the create stream operation fails, there are some scenarios where the cleanup may not be complete such as leadership change / crash. This can be checked by listing the streams and checking if there was a new stream-id that was created. In such cases, this stream should be deleted manually by the user using yb-admin.

1. If the stream can actually be manually deleted, the retention barriers will be released shortly thereafter
2. If the above manual step is not done, then retention barriers will be released after the timeout of 4hrs

Eventually, the rollback should be done inline with how we have implemented DDL atomicity for other statements such as `AlterTable` which would remove these limitations. This has been left as a future exercise with a TODO comment.
Jira: DB-7785

Test Plan:
New tests

```
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeSysCatalogEntry
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeInMemoryCommit
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterInMemoryStateCommit
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterDummy
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterRetentionBarriers
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureWhileStoringConsistentSnapshot
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterStoringConsistentSnapshot
```

Reviewers: asrinivasan, skumar, xCluster, hsunder

Reviewed By: asrinivasan, hsunder

Subscribers: ybase, ycdcxcluster, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D30918
dr0pdb added a commit that referenced this issue Jan 15, 2024
…ts test

Summary:
In https://phorge.dev.yugabyte.com/D30918, we introduced support for rolling back a failed CDC stream.

As part of the rollback revision, we enabled the UpdatePeersAndMetrics thread earlier in the flow (SetCDCServiceEnabled()). The test
TestReleaseResourcesOnUnpolledSplitTablets started intermittently failing. The failure happens because the UpdatePeersAndMetrics thread was caching the stream metadata
without the stream creation time.

The fix is to make the UpdatePeersAndMetrics thread refresh the cached stream metadata if the state is in Initialized state.
Jira: DB-7785

Test Plan:
Jenkins: test regex: .*CDCSDK.*

./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesOnUnpolledSplitTablets -n 5

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D31702
dr0pdb added a commit that referenced this issue Jan 19, 2024
…ream for CDCSDK streams

Summary:
**Backport Description:**
Had minor merge conflicts in tests.

**Original Description:**
Original commit: 0cc9693 / D30918
The creation of a CDCSDK stream involves multiple phases. In case of failures in-between, we need to rollback the creation.

This revision implements that by using a `ScopeExit` to mark the CDCSDK stream for deletion in case of failures.

The overall logic of rollback is as follows:
1. Upon failure, the stream is marked for deletion (state in stream metadata is set to `DELETING`)
2. The `CatalogManager::RunXClusterBgTasks` function called from CatalogManagerBgTasks, deletes the sys-catalog entry for the stream marked for deletion and updates the CDC state table checkpoints to `OpId::Max`
3. The `UpdatePeersAndMetrics` thread running in cdc_service looks up entries from CDC state table with `OpId::Max()`, releases the retention barriers and deletes the rows from the CDC state table.

Step 2 and 3 are existing code.

If the create stream operation fails, there are some scenarios where the cleanup may not be complete such as leadership change / crash. This can be checked by listing the streams and checking if there was a new stream-id that was created. In such cases, this stream should be deleted manually by the user using yb-admin.

1. If the stream can actually be manually deleted, the retention barriers will be released shortly thereafter
2. If the above manual step is not done, then retention barriers will be released after the timeout of 4hrs

Eventually, the rollback should be done inline with how we have implemented DDL atomicity for other statements such as `AlterTable` which would remove these limitations. This has been left as a future exercise with a TODO comment.
Jira: DB-7785

Test Plan:
New tests

```
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeSysCatalogEntry
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureBeforeInMemoryCommit
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterInMemoryStateCommit
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterDummy
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterRetentionBarriers
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureWhileStoringConsistentSnapshot
./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCSStreamFailureRollbackFailureAfterStoringConsistentSnapshot
```

Reviewers: asrinivasan, skumar, xCluster, hsunder

Reviewed By: hsunder

Subscribers: bogdan, ycdcxcluster, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31759
dr0pdb added a commit that referenced this issue Jan 19, 2024
…olledSplitTablets test

Summary:
Original commit: 9c370a0 / D31702
In https://phorge.dev.yugabyte.com/D30918, we introduced support for rolling back a failed CDC stream.

As part of the rollback revision, we enabled the UpdatePeersAndMetrics thread earlier in the flow (SetCDCServiceEnabled()). The test
TestReleaseResourcesOnUnpolledSplitTablets started intermittently failing. The failure happens because the UpdatePeersAndMetrics thread was caching the stream metadata
without the stream creation time.

The fix is to make the UpdatePeersAndMetrics thread refresh the cached stream metadata if the state is in Initialized state.
Jira: DB-7785

Test Plan:
Jenkins: test regex: .*CDCSDK.*

./yb_build.sh --cxx-test cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesOnUnpolledSplitTablets -n 5

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31806
@yugabyte-ci yugabyte-ci changed the title [YSQL] Support DDL atomicity for all publication/replication slot commands in YSQL [YSQL] Add rollback mechanism for failed CreateCDCStream requests Mar 8, 2024
@dr0pdb
Copy link
Contributor Author

dr0pdb commented Jun 4, 2024

Follow ups moved to a new issue: #22675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK area/ysql Yugabyte SQL (YSQL) kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants