Allow automated image updates to partly fail #2619

HaveFun83 · 2019-11-18T12:40:00Z

Describe the bug
A clear and concise description of what the bug is.
If one automated workload failed due to an error like #2618 all automated workloads will be skipped

To Reproduce
Steps to reproduce the behaviour:

fluxcd - helmoperator

check-in and automated two independent helmrelease workloads
1.1 one should have an error like fluxcd failed to automate helmrelease chart-image #2618
update the image within the docker-registry for one image
see the flux pod logs

ts=2019-11-15T04:46:43.896903279Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2019-11-15T04:46:44.631945883Z caller=images.go:111 component=sync-loop workload=test:helmrelease/truckdispo container=chart-image repo=private-docker-registry.net/test/app-dispo pattern=semver:~1.0 current=private-docker-registry.net/test/app-dispo:1.0.20191112025302 info="added update to automation run" new=private-docker-registry.net/test/app-dispo:1.0.20191115025305 reason="latest 1.0.20191115025305 (1970-01-01 00:00:00 +0000 UTC) > current 1.0.20191112025302 (1970-01-01 00:00:00 +0000 UTC)"
ts=2019-11-15T04:46:44.632758905Z caller=images.go:111 component=sync-loop workload=test:helmrelease/rabbitmq container=chart-image repo=private-docker-registry.net/public/bitnami/rabbitmq pattern=semver:~3.8 current=private-docker-registry.net/public/bitnami/rabbitmq:3.8.0-debian-9-r0 info="added update to automation run" new=private-docker-registry.net/public/bitnami/rabbitmq:3.8.1 reason="latest 3.8.1 (2019-11-11 09:54:51.753729911 +0000 UTC) > current 3.8.0-debian-9-r0 (2019-10-09 14:52:20.19194642 +0000 UTC)"
ts=2019-11-15T04:46:44.632938851Z caller=loop.go:135 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 state=in-progress
ts=2019-11-15T04:46:45.190432924Z caller=releaser.go:59 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 type=release updates=2
ts=2019-11-15T04:46:49.852120628Z caller=loop.go:145 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 state=done success=false err="verifying changes: failed to verify changes: the image for container \"chart-image\" in resource \"test:helmrelease/rabbitmq\" should be \"private-docker-registry.net/public/bitnami/rabbitmq:3.8.1\", but is \"private-docker-registry.net/bitnami/rabbitmq:3.8.1\""
ts=2019-11-15T04:47:45.09607582Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2019-11-15T04:47:45.593071926Z caller=images.go:111 component=sync-loop workload=test:helmrelease/rabbitmq container=chart-image repo=private-docker-registry.net/public/bitnami/rabbitmq pattern=semver:~3.8 current=private-docker-registry.net/public/bitnami/rabbitmq:3.8.0-debian-9-r0 info="added update to automation run" new=private-docker-registry.net/public/bitnami/rabbitmq:3.8.1 reason="latest 3.8.1 (2019-11-11 09:54:51.753729911 +0000 UTC) > current 3.8.0-debian-9-r0 (2019-10-09 14:52:20.19194642 +0000 UTC)"
ts=2019-11-15T04:47:45.59353292Z caller=images.go:111 component=sync-loop workload=test:helmrelease/truckdispo container=chart-image repo=private-docker-registry.net/test/app-dispo pattern=semver:~1.0 current=private-docker-registry.net/test/app-dispo:1.0.20191112025302 info="added update to automation run" new=private-docker-registry.net/test/app-dispo:1.0.20191115025305 reason="latest 1.0.20191115025305 (1970-01-01 00:00:00 +0000 UTC) > current 1.0.20191112025302 (1970-01-01 00:00:00 +0000 UTC)"

The log line
ts=2019-11-15T04:46:45.190432924Z caller=releaser.go:59 component=sync-loop jobID=cbcffd5d-5f97-de26-2303-f46eed9d7492 type=release updates=2
is misleading as none of the two images will be updated thru flux.

Only as the failed chart-image was locked flux was able to automate the other image again.

Expected behavior
A clear and concise description of what you expected to happen.
flux should log and skip failed automated releases properly, so that an error within one release should not affect all automated releases.

Logs
If applicable, please provide logs of fluxd or the helm-operator. In a standard stand-alone installation of Flux, you'd get this by running kubectl logs -n default deploy/flux.

see above

Additional context
Add any other context about the problem here, e.g

Flux version: 1.15.0
Helm Operator version: 1.0.0-rc3
Kubernetes version: 1.15.4
Git provider: private gitlab
Container registry provider: private harbor

The text was updated successfully, but these errors were encountered:

hiddeco · 2019-11-18T12:46:11Z

I think it would be better to make this behaviour possible by adding an additional flag to the fluxctl release command, e.g. --allow-failures (or the opposite, but this proposal would be backwards compatible).

This would make the behaviour and exit status code predictable based on the configured flags.

HaveFun83 · 2019-11-18T12:51:43Z

I think it would be better to make this behaviour possible by adding an additional flag to the fluxctl release command, e.g. --allow-failures (or the opposite, but this proposal would be backwards compatible).

This would make the behaviour and exit status code predictable based on the configured flags.

Cool thanks for your quick response.
But we don't use fluxctl in this case. Only the normal poll interval for new images and changes within the git repo

hiddeco · 2019-11-18T13:25:56Z

@squaremo given the above, is there a reason we currently have a 'succeed all or fail' policy for automated image updates?

squaremo · 2019-11-18T14:23:42Z

I would grant that it's not always -- or even usually -- necessary to make updates atomic (all-or-nothing).

Given the nature of Kubernetes, no-one should be relying on atomic changes. Even if committed to git, and applied to the API at the same time, they won't take effect at the same time. But inconsistencies due to partial updates would persist indefinitely, and probably need human intervention to resolve. So at the least, the failing parts should be made quite visible.

hiddeco · 2019-11-18T14:30:36Z

We could technically just add them to the commit note as 'attempted but failed', to make it observable outside the fluxd logs (and maybe even push them as a notification towards the upstream).

squaremo · 2019-11-18T14:38:05Z

We could technically just add them to the commit note as 'attempted but failed', to make it observable outside the fluxd logs (and maybe even push them as a notification towards the upstream).

Yes, and yes. And perhaps introduce (adapt?) a metric for the number of automated updates succeeding/failing.

HaveFun83 added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 18, 2019

hiddeco added enhancement UX In pursuit of a delightful user experience and removed blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 18, 2019

hiddeco changed the title ~~fluxcd all automated releases failed if sync-loop failed~~ Allow fluxctl release --update-all-images to partly fail Nov 18, 2019

hiddeco changed the title ~~Allow fluxctl release --update-all-images to partly fail~~ Allow image updates to partly fail Nov 18, 2019

hiddeco added the blocked-decision label Nov 18, 2019

squaremo changed the title ~~Allow image updates to partly fail~~ Allow automated image updates to partly fail Nov 18, 2019

hiddeco removed the blocked-decision label Nov 18, 2019

kingdonb closed this as completed Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow automated image updates to partly fail #2619

Allow automated image updates to partly fail #2619

HaveFun83 commented Nov 18, 2019 •

edited

Loading

hiddeco commented Nov 18, 2019

HaveFun83 commented Nov 18, 2019

hiddeco commented Nov 18, 2019

squaremo commented Nov 18, 2019

hiddeco commented Nov 18, 2019

squaremo commented Nov 18, 2019

Allow automated image updates to partly fail #2619

Allow automated image updates to partly fail #2619

Comments

HaveFun83 commented Nov 18, 2019 • edited Loading

hiddeco commented Nov 18, 2019

HaveFun83 commented Nov 18, 2019

hiddeco commented Nov 18, 2019

squaremo commented Nov 18, 2019

hiddeco commented Nov 18, 2019

squaremo commented Nov 18, 2019

HaveFun83 commented Nov 18, 2019 •

edited

Loading