containers in containersets not appropriately reporting status if terminated #8545

jeszii · 2022-04-29T18:17:07Z

Checklist

Double-checked my configuration.
Tested using the latest version.
Used the Emissary executor.

Summary

What happened/what you expected to happen?
When terminating a workflow due to a deadline, I expect that the workflow is terminated, yet we are still waiting for containerset containers to be terminated, even though they are finished with error in k8s.

An image of the workflow after timeout termination:

An image of the containers for shard-13 in k8s:

What version are you running?
3.3.2

Reproducible Workflow

reproducible-workflow.txt

Logs from the workflow controller:

controller-logs.txt

The workflow's pods that are problematic:

items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs from in your workflow's wait container:

wait-logs.txt

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2022-04-29T18:19:44Z

I think this might be fixed by #8478. Could you please try :latest?

jeszii · 2022-04-29T19:15:19Z

Hi @alexec, thanks for the swift reply. :D
Can you elaborate on what needs to be upgraded to latest?

I had a colleague run the workflow on a fresh argo set up using this manifest and the problem seems to persist.

alexec · 2022-04-29T19:52:32Z

You need to upgrade the workflow-controller. Can you double check? It's pretty easy to apply the wrong manifests.

jeszii · 2022-04-29T19:56:57Z

I can double check but it won't be until next week, sorry! I have attached the workflow that reproduces the error if you'd like to run it beforehand though.

cwood-uk · 2022-05-03T09:32:25Z

We have built argo locally off the master branch and ran the attached workflow, we are still seeing the same errors

jeszii · 2022-05-03T12:18:37Z

thanks @cwood-uk

alexec · 2022-05-03T16:14:07Z

This issue is missing workflow YAML to run locally.

jeszii · 2022-05-03T16:25:13Z

@alexec The YAML is the reproducible-workflow.txt attached in the issue, Github doesn't allow the upload of YAML files. Have attached here again. :)
reproducible-workflow.txt

alexec · 2022-05-03T18:01:49Z

Thank you. Can repro locally.

alexec · 2022-05-03T18:05:09Z

Restarting the controller does not fix this.

alexec · 2022-05-04T19:27:29Z

Pods do complete correctly. The workflow status is not being updated. This points to a problem it the controller. E.g. in operator.go

alexec · 2022-05-04T19:36:47Z

Workflow stops reconciling, but is not labelled completed.

alexec · 2022-05-04T20:24:41Z

Hypothesis: Pods are being labelled as complete, before the node status has been update.

alexec · 2022-05-04T22:49:13Z

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: masscan-dummy-scan
spec:
  entrypoint: run-plugin
  activeDeadlineSeconds: 20
  arguments:
    parameters:
      - name: plugin
        value: "masscan"
  templates:
    - name: run-plugin
      dag:
        tasks:
          - name: shard-1
            template: run-masscan
          - name: shard-2
            template: run-masscan
          - name: shard-3
            template: run-masscan
          - name: shard-4
            template: run-masscan
          - name: shard-5
            template: run-masscan
          - name: shard-6
            template: run-masscan
          - name: shard-7
            template: run-masscan
          - name: shard-8
            template: run-masscan
          - name: shard-9
            template: run-masscan
          - name: shard-10
            template: run-masscan
    - name: run-masscan
      containerSet:
        containers:
          - name: a
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "10"
          - name: b
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "80"
            dependencies:
              - a
          - name: c
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "30"
            dependencies:
              - b

Simpler. Faster.

alexec · 2022-05-04T22:55:52Z

This is caused by exec_control.go deleting pods. Instead, lets terminate them.

Signed-off-by: Alex Collins <[email protected]>

jeszii · 2022-05-06T12:13:15Z

Thanks for all your work on this, let me know if you need any help from us.

Signed-off-by: Alex Collins <[email protected]>

jeszii · 2022-06-14T10:55:58Z

Hey @alexec, I noticed this was cherry picked for 3.3.6 so I upgraded, but I still seem to be getting the same issues where the workflow is still running after 'termination'. Am I right in thinking this was fixed in 3.3.6 or have I jumped the gun?

jeszii · 2022-06-22T09:41:44Z

Hey @sarabala1979, can you confirm whether this has been merged into the :latest tag? Right now I am still getting the same errors.

the1schwartz · 2022-09-20T08:12:45Z

I can still reproduce this in 3.3.9.

jeszii · 2022-09-20T12:20:35Z

@the1schwartz @alexec I can also still reproduce this in 3.4.0

jeszii · 2022-10-04T09:55:25Z

fyi @alexec

jeszii added type/bug triage labels Apr 29, 2022

alexec added this to the 2022 Q1 milestone May 3, 2022

alexec removed this from the 2022 Q1 milestone May 3, 2022

alexec self-assigned this May 3, 2022

alexec added the area/templates/container-set label May 3, 2022

alexec changed the title ~~containers in containersets not appropriately reporting status~~ containers in containersets not appropriately reporting status if terminated May 4, 2022

alexec added a commit to alexec/argo-workflows that referenced this issue May 5, 2022

fix: Terminate rather than delete deadlined pods. Fixes argoproj#8545

8bdb9c0

Signed-off-by: Alex Collins <[email protected]>

This was referenced May 5, 2022

fix: Terminate, rather than delete, deadlined pods. Fixes #8545 #8620

Merged

v3.4 #8632

Closed

alexec added size/S 1-2 days and removed triage labels May 5, 2022

alexec closed this as completed in #8620 May 6, 2022

alexec added a commit that referenced this issue May 6, 2022

fix: Terminate, rather than delete, deadlined pods. Fixes #8545 (#8620)

859ebe9

Signed-off-by: Alex Collins <[email protected]>

sarabala1979 mentioned this issue May 25, 2022

Cherry pick v3.3.6 #8854

Closed

14 tasks

sarabala1979 mentioned this issue Jun 20, 2022

Cherry-pick 3.3.7 #9012

Closed

55 tasks

sarabala1979 mentioned this issue Jun 23, 2022

Cherry-pick v3.3.8 #9029

Closed

51 tasks

sarabala1979 mentioned this issue Jul 30, 2022

Cherry-pick 3.3.9 #9262

Closed

51 tasks

sarabala1979 mentioned this issue Nov 8, 2022

v3.3 Cherry-picking #10000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containers in containersets not appropriately reporting status if terminated #8545

containers in containersets not appropriately reporting status if terminated #8545

jeszii commented Apr 29, 2022

alexec commented Apr 29, 2022

jeszii commented Apr 29, 2022

alexec commented Apr 29, 2022

jeszii commented Apr 29, 2022

cwood-uk commented May 3, 2022

jeszii commented May 3, 2022

alexec commented May 3, 2022

jeszii commented May 3, 2022

alexec commented May 3, 2022

alexec commented May 3, 2022

alexec commented May 4, 2022

alexec commented May 4, 2022 •

edited

Loading

alexec commented May 4, 2022

alexec commented May 4, 2022

alexec commented May 4, 2022

jeszii commented May 6, 2022

jeszii commented Jun 14, 2022

jeszii commented Jun 22, 2022

the1schwartz commented Sep 20, 2022 •

edited

Loading

jeszii commented Sep 20, 2022

jeszii commented Oct 4, 2022

containers in containersets not appropriately reporting status if terminated #8545

containers in containersets not appropriately reporting status if terminated #8545

Comments

jeszii commented Apr 29, 2022

Checklist

Summary

Reproducible Workflow

Logs from the workflow controller:

The workflow's pods that are problematic:

Logs from in your workflow's wait container:

alexec commented Apr 29, 2022

jeszii commented Apr 29, 2022

alexec commented Apr 29, 2022

jeszii commented Apr 29, 2022

cwood-uk commented May 3, 2022

jeszii commented May 3, 2022

alexec commented May 3, 2022

jeszii commented May 3, 2022

alexec commented May 3, 2022

alexec commented May 3, 2022

alexec commented May 4, 2022

alexec commented May 4, 2022 • edited Loading

alexec commented May 4, 2022

alexec commented May 4, 2022

alexec commented May 4, 2022

jeszii commented May 6, 2022

jeszii commented Jun 14, 2022

jeszii commented Jun 22, 2022

the1schwartz commented Sep 20, 2022 • edited Loading

jeszii commented Sep 20, 2022

jeszii commented Oct 4, 2022

alexec commented May 4, 2022 •

edited

Loading

the1schwartz commented Sep 20, 2022 •

edited

Loading