Add a BSL controller to handle validation + update BSL status phase #2490

carlisia · 2020-05-01T01:59:41Z

add a controller-runtime controller that continuously validates BSLs and marks them as available (valid) or unavailable (invalid), and logs accordingly.
remove BSL validation from the server startup

Signed-off-by: Carlisia [email protected]

pkg/controller/backup_storage_location_controller.go

nrb

On the whole this looks good! The main question I have right now is around resync periods.

Signed-off-by: Carlisia <[email protected]>

carlisia · 2020-05-04T16:31:23Z

I added:

StoreValidationPeriod
LastValidationTime

Much in line with what's in the BackupSynchController.

I also added this, for the state between creation and validation:

BackupStorageLocationPhaseUnverified

However, although I have verified this new phase was included in the CRD, it does not mark it as the default. Not sure if I'm doing this wrong:
https://github.com/vmware-tanzu/velero/pull/2490/files#diff-7c4f4164f55721d9493c9824d4735f60R101

After this PR is merged, I'm going to add these columns (StoreValidationPeriod and LastValidationTime) to the BSL output: #2421

skriss

I still need to look at the body of the controller in a little more detail, but so far so good!

We also need to think about how other controllers/packages should incorporate this information. For example, AFAIK, even if a BSL is marked "Unavailable", the backup controller will still try to process backups targeting that location. Those backups will theoretically fail somewhere along the line, at least at the upload step, but ideally we'd check up-front if the BSL is valid or not, and fail validation if it's invalid.

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

skriss · 2020-05-04T17:29:14Z

pkg/controller/backup_storage_location_controller.go

+		newBackupStore:   persistence.NewObjectBackupStore,
+	}
+
+	c.resyncFunc = c.run


I think in addition to the periodic validation via this resync func, we should also add OnAdd/OnUpdate event handlers to the backup storage location informer so that whenever a BSL is created or updated, it immediately triggers a validation. WDYT?

Yes, good idea.

Would we need both the handlers and the BSL validation though?

I'm thinking add the validation to the handlers (create/update), and keep the BSL looping and checking the CR field just so it does the proper periodic logging. Would this work?

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

If we have the handler validating right on update then it will never be out-of-date. Right?

I think it makes sense to do the validation both when a BSL is added/updated, and periodically (via the resync func). That way, we immediately get an updated phase when the BSL is created or changed, but we are also regularly checking to ensure we can still connect to the object store.

It could still be out-of-date: the minio instance goes down, the S3 credentials are expired, the bucket is deleted, etc. None of those would involve a change to the BSL, but could still mean that it can't be connected to.

Yes, good point!

carlisia · 2020-05-04T19:01:18Z

Ok recapping, I will:

add handlers to do additional validation/update of a BSL status on BSL create/update
keep the validation on the BSL controller so it also validates/updates (and logs) it there
add a validation to backup creation to check if the corresponding BSL is valid

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

This question pertains to item # 3. My thinking below, taking into account @nrb's comment:

We may want to make this tuneable, as there can be monetary access costs associated with hitting services like S3. This change moves the validation from a one-time startup operation to a recurring one, so I don't want to surprise people with new bills :)

I think doing the validation on create/update + periodically could be good enough and checking for the BSL status only (as opposed to the store is valid) at backup creation is the way to go. Since there are potential costs associated with it, we should aim for less connections to the storage rather than more. Another factor is that each BSL can have its validation period decreased for more frequent checks: https://github.com/vmware-tanzu/velero/pull/2490/files#diff-0528e5cf4cf25ec346aa95f0bd1b2fd4R79 so the user has this level of control.

WDYT? We could go with this and add it to the backup creation later if we find out we need it.

Updated from the original

skriss · 2020-05-04T20:31:31Z

That all sounds reasonable to me! Just a couple clarifications/related comments:

for item # 1, those should go inside the same controller as item # 2 - so you'll have both event handlers on the informer, and a periodic resync function, in that one controller. The restic repository controller does something very similar: https://github.com/vmware-tanzu/velero/blob/master/pkg/controller/restic_repository_controller.go#L78-L87
there are probably other controllers beyond the backup controller where we want to check the phase of the BSL up-front. Probably need to just go through them one-by-one and see if it makes sense to add a check on that phase somewhere at the start of their processing

carlisia · 2020-05-04T20:41:14Z

Oh ok, I was putting the handlers somewhere else, haha.

Great heads up all around. I will comb the controllers. Thanks!

ashish-amarnath · 2020-05-04T20:41:31Z

pkg/cmd/cli/backuplocation/create.go

+	Bucket                string
+	Prefix                string
+	BackupSyncPeriod      time.Duration
+	StoreValidationPeriod time.Duration


would validationFrequency be a better name for this field?

Yes, yes indeed!

ashish-amarnath · 2020-05-04T20:43:00Z

pkg/cmd/cli/backuplocation/create.go

@@ -107,6 +109,10 @@ func (o *CreateOptions) Validate(c *cobra.Command, args []string, f client.Facto
 		return errors.New("--backup-sync-period must be non-negative")
 	}

+	if o.StoreValidationPeriod < 0 {


should we set this to default and continue?

Not sure. What if the user for some reason expects that value to take, and we give them another value. Could this be possible?

ashish-amarnath · 2020-05-04T20:44:28Z

pkg/controller/backup_storage_location_controller.go

+	newPluginManager func(logrus.FieldLogger) clientmgmt.Manager,
+	logger logrus.FieldLogger,
+) Interface {
+	if defaultStoreValidationPeriod <= 0 {


if defaultStoreValidationPeriod == 0 doesn't that mean validation is disabled?

Yes so, this is confusing because of two flags with the same name. Mainly I set this up following the backup-sync-controller. We can remove this or change the condition here.

This right here is a flag on the server. It is set so that IF location.Spec.StoreValidationPeriod further below is not set, we'll use this value. And so, if there wasn't a value passed thru the server either, it is set to 1min.

We could change this to match the BSL flag, in other words, if <=0 then don't ever validate. But I don't think we want to never validate. And I know that's not what you are saying either. Since we do have to validate at some interval, we might as well make it configurable thru the server.

Addendum: *we don't want to never validate a BSL >unless< the user specifically configures it that way.

ashish-amarnath · 2020-05-04T20:51:23Z

pkg/controller/backup_storage_location_controller.go

+	for _, location := range locations {
+		locationName := location.Name
+		log = c.logger.WithField("backupLocation", locationName)
+


Is location.Spec.StoreValidationPeriod == nil also a condition where we should skip validation?

No. Only a specific value of zero is meant to skip validation. Any other value will set the frequency to that value (obvs), and nil means they didn't set it, which then the frequency will be set to the server value, either one set by the user, or the default. Pls be sure to see my comment up above.

pkg/controller/backup_storage_location_controller.go

Signed-off-by: Carlisia <[email protected]>

carlisia · 2020-05-05T01:02:39Z

At any rate.. relying on len(unavailableErrors) == len(locations) no longer works because due to the validation only occurring based on the validation frequency, the validation might be skipped for that reason so we don't know at the end of the iteration what that number truly is.

I'm going to add a method to return all BSLs that are in each phase, or maybe just the count for each, so this logging logic will be much cleaner and the rules clearer.

It might be something we'll want to add to the server status anyway, the listing of each existing BSL, grouped by status/phase.

Signed-off-by: Carlisia <[email protected]>

skriss · 2020-05-08T18:18:00Z

@carlisia the kubebuilder convo reminded me that i do have a couple recommendations on how to structure this that (a) I think would make a potential migration to kubebuilder easier, and (b) would make testing easier, regardless of whether we move to kubebuilder or not. I'm not going to have time to write detailed notes until next week though, but here's a quick summary:

The main idea would be to keep the core "business logic" out of the controller package/struct, and extract it to its own package with a clearly defined interface. the controller should just handle responding to informer events, putting things on a work queue, taking things off the work queue, and then calling into the "business logic" package to actually handle the processing. This would be similar to (a) the server status request controller; and (b) the work I was doing in #2317. Having all the business logic in the controller package and structs IMHO is not the pattern we want to be following going fwd.

carlisia · 2020-05-08T18:34:43Z

Yes, I started thinking this very thing when I was thinking about the webhooks and, regardless, you're right that the structure needs to have the logic not in the controller. Glad for the nudge.

carlisia · 2020-05-28T18:37:26Z

Moved it to Draft because likely will be layering this work on top of #2561.

carlisia · 2020-06-03T14:11:23Z

Closing this PR. I'll be redoing this in a new PR using the kubebuilder/controller-runtime framekwork as per #2597.

carlisia requested review from skriss, nrb and ashish-amarnath May 1, 2020 02:01

carlisia force-pushed the c-1967-readiness branch from 9e158f5 to 595f930 Compare May 1, 2020 02:10

carlisia changed the title ~~Add BSL status phase based on validation and log according; remove validation from the server~~ Add BSL status phase based on validation and log accordingly; remove validation from the server May 1, 2020

carlisia mentioned this pull request May 1, 2020

Add status column to get BSL output #2493

Merged

carlisia self-assigned this May 1, 2020

nrb reviewed May 1, 2020

View reviewed changes

pkg/controller/backup_storage_location_controller.go Show resolved Hide resolved

nrb reviewed May 1, 2020

View reviewed changes

Carlisia added 4 commits May 4, 2020 08:54

Add BSL status phase based on validation

9cd0479

Signed-off-by: Carlisia <[email protected]>

Add generated files

3584d0e

Signed-off-by: Carlisia <[email protected]>

Add changelog

9166ef8

Signed-off-by: Carlisia <[email protected]>

Rebased

e4e7273

Signed-off-by: Carlisia <[email protected]>

carlisia force-pushed the c-1967-readiness branch from a7abe55 to e4e7273 Compare May 4, 2020 16:15

carlisia requested a review from nrb May 4, 2020 16:21

skriss suggested changes May 4, 2020

View reviewed changes