Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Report errors at kubernetes level #2695

Closed
Docteur-RS opened this issue Dec 16, 2019 · 6 comments
Closed

Report errors at kubernetes level #2695

Docteur-RS opened this issue Dec 16, 2019 · 6 comments

Comments

@Docteur-RS
Copy link

Describe the feature

Actual state

When a fatal error appears in Flux it gets logged on the output of the container.
(I call fatal error things like cloning or yaml syntax errors)
The only way to access thoses errors and know the state of Flux is to read the logs using kubectl logs flux

Requested feature

An enhancement would be to have fatal errors reported directly inside the events of the pod.
Then we could do a simple kubeclt describe pod flux and have a general idea of what is going on.

If this is too complicated, maybe just set an annotation on the pod to advertise the status (either Ok or Error state)

What would the new user story look like?
When the last commit seems to not being deployed in time and you have to investigate, users can check events on the namespace or on the Flux pod directly without having to get into the logs.
This allows everyone to debug Flux problems instead of only people used to read its logs.

How would the new interaction with Flux look like? E.g.

  • Commit/push
  • fluxctl sync
  • kubectl get events -n flux

Expected behavior
Errors (or Flux state at least) should appear at the kubernetes level.

@Docteur-RS Docteur-RS added blocked-needs-validation Issue is waiting to be validated before we can proceed enhancement labels Dec 16, 2019
@2opremio 2opremio added help wanted and removed blocked-needs-validation Issue is waiting to be validated before we can proceed labels Jan 10, 2020
@2opremio
Copy link
Contributor

I don't know how hard it is for a pod to change its own status (or whether it's possible at all) but it seems like abusing Kubernetes' status mechanism, since the problem is likely not coming from the pod itself but most likely environmental.

I think that prometheus metrics are a much better solution (e.g. #2535 )

@Docteur-RS
Copy link
Author

I have been using Kubefed lately and I really loved the events being reported at the kubernetes level. The events were not logged on the kubefed's pods themselves but instead on each federated ressources that were created.

When I think of it, its similar in how the helm operator works. Maybe the status of each deployments could be added to the events of the HelmReleases (though this is not the correct repo to debate on the subject) ?

For Flux itself, maybe it could report events at the namespace level ?
However I keep thinking that if a simple describe on the Flux pod could get us the status of the synchronization that would be great ! ;-)

@stefanprodan
Copy link
Member

stefanprodan commented Jan 10, 2020

Flux doesn't own the pods, nor the namespaces so issuing events on those objects is not an option. Kubernetes does event compactation so relying on events for critical info is not the best idea. I don't see any advantages of running describe vs log.

@Docteur-RS
Copy link
Author

Docteur-RS commented Jan 13, 2020

Ok lets forget about events IN the events section then.
=> Why not propagate them inside a status section inside the flux pod ?

We can all read events from the logs but there is quite a lot of things going on in there...
Am I the only one getting thousands of lines not concerning the current status of gitops ? [flux logs snipet example]

In my personnal experience, I often think that something's wrong and I go through the logs and find nothing. In general its because nothing is wrong and its just comming from elsewhere but If I had access to k describe flux_pod => status: synced or status: synced error I would loose less time wondering what is happening inside Flux's brain and act accordingly.

I don't represent huge corporations with hundreds or more flux instances. But I do have arround
forty Flux deploying some Kubefed ressources and its starting to get messy in there...

I think that having quick feedback of what each Flux is doing has some value. But maybe its just me ;-)

@Docteur-RS
Copy link
Author

Examples of valid status I could think of :

Field one:
Empty => repo has not being cloned
cloned => repo has being cloned but not synced yet
Synced => repo has been synced
cloneError => cloning repo has failed
SyncedError => Sync of files has failed

A second field would have the current hash commit.
A third field would store any potential error messages. Especially for the applying steps when there's an error on some specific file, getting its name would be life changing...

Finally I would love a section status.runtime that would show what Flux is doing at the moment.
I'm having a hard time knowing when the next synchronisation will take place.
There seems to be a step of synchronisation and a step where it applies. I often find myself rushing a commit to get it before the next sync/pull and hopping it will be part of the next release.
If I could know a little more about what Flux was up to, it would appear less as a black box.

IDK maybe this is just my dream. But it doesn't hurt to ask !

@kingdonb
Copy link
Member

Thank you for the suggestion. It cannot be implemented in Flux v1 for reasons I think were explained in the thread.

In the next version of Flux, this feedback has been incorporated though, and although we do not write Flux status errors into pod events, since that is still not possible, the design has been rebased on an API of CRDs for Flux v2, also known as the GitOps Toolkit. The failure states that can be encountered (like "failed to get latest commit from git", "failed to apply manifests to cluster", ...) are now observable as Kubernetes Events on the CRDs Kustomization, HelmRelease, GitRepository.

Flux v1 is in maintenance mode now, and is not adding any new features unless they are critical.

As Flux contrib efforts have been focused on Flux v2, the Flux project has moved to a new repo, fluxcd/flux2

In the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode, and respecting you may have moved on already, I will go ahead and close out this issue for now.

If you have a use case for Flux that isn't covered well in the new Flux v2 (which is a total rewrite), we want to hear about it.

If you've been following our development efforts then of course we hope you are able to upgrade, here's more info on how to find support with that: https://fluxcd.io/support/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants