Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: completed workflow tracking (#3)
Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with `when: "{{workflow.status}} != Succeeded"` it may fire the exit handler erroneously. Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master ``` apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: exit- spec: entrypoint: entrypoint onExit: exit-handler templates: - name: entrypoint container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "hello world" ] resources: limits: memory: 32Mi cpu: 100m - name: ohnoes container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "oh noes" ] resources: limits: memory: 32Mi cpu: 100m - name: exit-handler steps: - - name: ohnoes when: "{{workflow.status}} != Succeeded" template: ohnoes ``` It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens. It happens rarely in real workloads but if `ohnoes` is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it. Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'. Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen. The problem is that as a workflow completes we drop it from the informer cache (it no longer meets `reconciliationNeeded`). This allows it to come back into the cache in an old form as a new resource. If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error. So we have a workflow which is in error exiting, and test the onExit `when` clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster. Fixes considered: * Attempt to ensure we're up to date before actioning the onExit. This seemed like a very convoluted code path for doing special onExit handling and likely to introduce more bugs than it fixed. This would either put much more load on the API by causing the controller to get the latest version, or attempt the write back to see if we were out of date. * Attempt to prevent old version re-entry in the cache rather than at the reconcile stage. I tried this, but we still have a hole where it'll make it through. So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing. Signed-off-by: Alan Clucas <[email protected]>
- Loading branch information