Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Allow automated image updates to partly fail #2619

Closed
HaveFun83 opened this issue Nov 18, 2019 · 6 comments
Closed

Allow automated image updates to partly fail #2619

HaveFun83 opened this issue Nov 18, 2019 · 6 comments
Labels
enhancement UX In pursuit of a delightful user experience

Comments

@HaveFun83
Copy link

HaveFun83 commented Nov 18, 2019

Describe the bug
A clear and concise description of what the bug is.
If one automated workload failed due to an error like #2618 all automated workloads will be skipped

To Reproduce
Steps to reproduce the behaviour:

fluxcd - helmoperator

  1. check-in and automated two independent helmrelease workloads
    1.1 one should have an error like fluxcd failed to automate helmrelease chart-image #2618
  2. update the image within the docker-registry for one image
  3. see the flux pod logs
ts=2019-11-15T04:46:43.896903279Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2019-11-15T04:46:44.631945883Z caller=images.go:111 component=sync-loop workload=test:helmrelease/truckdispo container=chart-image repo=private-docker-registry.net/test/app-dispo pattern=semver:~1.0 current=private-docker-registry.net/test/app-dispo:1.0.20191112025302 info="added update to automation run" new=private-docker-registry.net/test/app-dispo:1.0.20191115025305 reason="latest 1.0.20191115025305 (1970-01-01 00:00:00 +0000 UTC) > current 1.0.20191112025302 (1970-01-01 00:00:00 +0000 UTC)"
ts=2019-11-15T04:46:44.632758905Z caller=images.go:111 component=sync-loop workload=test:helmrelease/rabbitmq container=chart-image repo=private-docker-registry.net/public/bitnami/rabbitmq pattern=semver:~3.8 current=private-docker-registry.net/public/bitnami/rabbitmq:3.8.0-debian-9-r0 info="added update to automation run" new=private-docker-registry.net/public/bitnami/rabbitmq:3.8.1 reason="latest 3.8.1 (2019-11-11 09:54:51.753729911 +0000 UTC) > current 3.8.0-debian-9-r0 (2019-10-09 14:52:20.19194642 +0000 UTC)"
ts=2019-11-15T04:46:44.632938851Z caller=loop.go:135 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 state=in-progress
ts=2019-11-15T04:46:45.190432924Z caller=releaser.go:59 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 type=release updates=2
ts=2019-11-15T04:46:49.852120628Z caller=loop.go:145 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 state=done success=false err="verifying changes: failed to verify changes: the image for container \"chart-image\" in resource \"test:helmrelease/rabbitmq\" should be \"private-docker-registry.net/public/bitnami/rabbitmq:3.8.1\", but is \"private-docker-registry.net/bitnami/rabbitmq:3.8.1\""
ts=2019-11-15T04:47:45.09607582Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2019-11-15T04:47:45.593071926Z caller=images.go:111 component=sync-loop workload=test:helmrelease/rabbitmq container=chart-image repo=private-docker-registry.net/public/bitnami/rabbitmq pattern=semver:~3.8 current=private-docker-registry.net/public/bitnami/rabbitmq:3.8.0-debian-9-r0 info="added update to automation run" new=private-docker-registry.net/public/bitnami/rabbitmq:3.8.1 reason="latest 3.8.1 (2019-11-11 09:54:51.753729911 +0000 UTC) > current 3.8.0-debian-9-r0 (2019-10-09 14:52:20.19194642 +0000 UTC)"
ts=2019-11-15T04:47:45.59353292Z caller=images.go:111 component=sync-loop workload=test:helmrelease/truckdispo container=chart-image repo=private-docker-registry.net/test/app-dispo pattern=semver:~1.0 current=private-docker-registry.net/test/app-dispo:1.0.20191112025302 info="added update to automation run" new=private-docker-registry.net/test/app-dispo:1.0.20191115025305 reason="latest 1.0.20191115025305 (1970-01-01 00:00:00 +0000 UTC) > current 1.0.20191112025302 (1970-01-01 00:00:00 +0000 UTC)"

The log line
ts=2019-11-15T04:46:45.190432924Z caller=releaser.go:59 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 type=release updates=2
is misleading as none of the two images will be updated thru flux.

Only as the failed chart-image was locked flux was able to automate the other image again.

Expected behavior
A clear and concise description of what you expected to happen.
flux should log and skip failed automated releases properly, so that an error within one release should not affect all automated releases.

Logs
If applicable, please provide logs of fluxd or the helm-operator. In a standard stand-alone installation of Flux, you'd get this by running kubectl logs -n default deploy/flux.

see above

Additional context
Add any other context about the problem here, e.g

  • Flux version: 1.15.0
  • Helm Operator version: 1.0.0-rc3
  • Kubernetes version: 1.15.4
  • Git provider: private gitlab
  • Container registry provider: private harbor
@HaveFun83 HaveFun83 added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 18, 2019
@hiddeco
Copy link
Member

hiddeco commented Nov 18, 2019

I think it would be better to make this behaviour possible by adding an additional flag to the fluxctl release command, e.g. --allow-failures (or the opposite, but this proposal would be backwards compatible).

This would make the behaviour and exit status code predictable based on the configured flags.

@hiddeco hiddeco added enhancement UX In pursuit of a delightful user experience and removed blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 18, 2019
@hiddeco hiddeco changed the title fluxcd all automated releases failed if sync-loop failed Allow fluxctl release --update-all-images to partly fail Nov 18, 2019
@HaveFun83
Copy link
Author

I think it would be better to make this behaviour possible by adding an additional flag to the fluxctl release command, e.g. --allow-failures (or the opposite, but this proposal would be backwards compatible).

This would make the behaviour and exit status code predictable based on the configured flags.

Cool thanks for your quick response.
But we don't use fluxctl in this case. Only the normal poll interval for new images and changes within the git repo

@hiddeco hiddeco changed the title Allow fluxctl release --update-all-images to partly fail Allow image updates to partly fail Nov 18, 2019
@hiddeco
Copy link
Member

hiddeco commented Nov 18, 2019

@squaremo given the above, is there a reason we currently have a 'succeed all or fail' policy for automated image updates?

@squaremo squaremo changed the title Allow image updates to partly fail Allow automated image updates to partly fail Nov 18, 2019
@squaremo
Copy link
Member

I would grant that it's not always -- or even usually -- necessary to make updates atomic (all-or-nothing).

Given the nature of Kubernetes, no-one should be relying on atomic changes. Even if committed to git, and applied to the API at the same time, they won't take effect at the same time. But inconsistencies due to partial updates would persist indefinitely, and probably need human intervention to resolve. So at the least, the failing parts should be made quite visible.

@hiddeco
Copy link
Member

hiddeco commented Nov 18, 2019

We could technically just add them to the commit note as 'attempted but failed', to make it observable outside the fluxd logs (and maybe even push them as a notification towards the upstream).

@squaremo
Copy link
Member

We could technically just add them to the commit note as 'attempted but failed', to make it observable outside the fluxd logs (and maybe even push them as a notification towards the upstream).

Yes, and yes. And perhaps introduce (adapt?) a metric for the number of automated updates succeeding/failing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement UX In pursuit of a delightful user experience
Projects
None yet
Development

No branches or pull requests

4 participants