Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full snapshot lease update retry on failure #711

Merged

Conversation

anveshreddy18
Copy link
Contributor

@anveshreddy18 anveshreddy18 commented Jan 30, 2024

What this PR does / why we need it:

Currently a Full Snapshot Lease update is triggered when a Full Snapshot ( either scheduled or out-of-schedule ) is taken. And as the Full Snapshot is taken every 24 hrs ( configurable ), if there ever is a failure in updating the Full Snapshot lease, it has to wait for the next full snapshot for it to get updated, which is a long time to wait in this case. This creates a problem which is well documented in this issue by @unmarshall, thanks for that!.

This PR attempts to update the full snapshot lease by periodically trying to update it with an interval defined by FullSnapshotLeaseUpdateInterval in the snapshotter.healthConfig. The retry stops once the lease is upto date, so as to not make unnecessary calls to API server. Basically ensuring that full snapshot lease is upto date for most of the time.

Which issue(s) this PR fixes:
Fixes #678

NOTE: With the etcd-druid PR#764 getting merged, it allows to configure the fullSnapshotLeaseUpdateInterval from Etcd yaml.

Special notes for your reviewer:
To test this with kind setup, remove the get option under lease from the role used by etcd-test serviceaccount and trigger a full snapshot, the snapshotter won't be able to fetch the lease hence failure, check the backup-restore logs, now insert the get back and see the lease getting updated in the next call to lease update, and the periodic retry stops. Tip: decrease the FullSnapshotLeaseUpdateInterval time to 1 minute to make this process faster.

Release note:

Introduces periodic updates to the Full Snapshot Lease, addressing delays in lease updates during failures
Introduced a new flag `full-snapshot-lease-update-interval` that can be used to set the periodic interval for full snapshot lease update. If the flag is not set, default interval of 3 minutes is considered. 

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner January 30, 2024 16:35
@gardener-robot gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jan 30, 2024
@anveshreddy18 anveshreddy18 added kind/bug Bug and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jan 30, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 30, 2024
@gardener-robot gardener-robot added the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Jan 31, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 31, 2024
@anveshreddy18 anveshreddy18 self-assigned this Jan 31, 2024
@shreyas-s-rao shreyas-s-rao self-assigned this Feb 1, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 1, 2024
@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Feb 6, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 6, 2024
@anveshreddy18 anveshreddy18 force-pushed the bug/full-snapshot-lease-update branch from ba80618 to f696a32 Compare February 6, 2024 06:25
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 6, 2024
@anveshreddy18 anveshreddy18 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 6, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 6, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 6, 2024
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
.ci/unit_test Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshot_lease_update.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshotter.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshotter_test.go Outdated Show resolved Hide resolved
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 13, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 13, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 13, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 13, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 14, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 14, 2024
@anveshreddy18 anveshreddy18 force-pushed the bug/full-snapshot-lease-update branch from e133c02 to 4a797a1 Compare March 14, 2024 12:52
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 14, 2024
Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just few nits.

pkg/snapshot/snapshotter/snapshotter_test.go Outdated Show resolved Hide resolved
pkg/snapshot/snapshotter/snapshotter_test.go Outdated Show resolved Hide resolved
@anveshreddy18 anveshreddy18 force-pushed the bug/full-snapshot-lease-update branch from 4a797a1 to 883a63c Compare March 15, 2024 03:43
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 15, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 15, 2024
Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@ishan16696 ishan16696 requested a review from renormalize March 15, 2024 10:50
@ishan16696 ishan16696 merged commit ef07e80 into gardener:master Mar 15, 2024
9 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Mar 15, 2024
@anveshreddy18 anveshreddy18 added this to the v0.29.0 milestone Mar 18, 2024
renormalize pushed a commit to renormalize/etcd-backup-restore that referenced this pull request Jul 4, 2024
* Full snapshot lease update retry on failure

* nit changes

* Address review comments by @ishan16696

* Added unit tests for RenewFullSnapshotLeasePeriodically() func

* check unit tests on prow

* Address review comments

* minor change in logs

* Add a snapshotter method to set lease update interval

* nit change

* Address review comments

* Resolve unit tests failure

* Improve interval time for unit tests

* nit change

* Address review comments by @ishan16696

* Address review comments by @ishan16696

* make tests pass

* nit change

* Address review comments by @ishan16696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Full Snapshot lease update should be retried and should not wait for the next full snapshot trigger
8 participants