Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containers in containersets not appropriately reporting status if terminated #8545

Closed
3 tasks done
Tracked by #8632
jeszii opened this issue Apr 29, 2022 · 21 comments · Fixed by #8620
Closed
3 tasks done
Tracked by #8632

containers in containersets not appropriately reporting status if terminated #8545

jeszii opened this issue Apr 29, 2022 · 21 comments · Fixed by #8620

Comments

@jeszii
Copy link

jeszii commented Apr 29, 2022

Checklist

  • Double-checked my configuration.
  • Tested using the latest version.
  • Used the Emissary executor.

Summary

What happened/what you expected to happen?
When terminating a workflow due to a deadline, I expect that the workflow is terminated, yet we are still waiting for containerset containers to be terminated, even though they are finished with error in k8s.

An image of the workflow after timeout termination:
Screen Shot 2022-04-29 at 7 05 21 PM

An image of the containers for shard-13 in k8s:
Screen Shot 2022-04-29 at 7 07 00 PM

What version are you running?
3.3.2

Reproducible Workflow

reproducible-workflow.txt

Logs from the workflow controller:

controller-logs.txt

The workflow's pods that are problematic:

items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs from in your workflow's wait container:

wait-logs.txt


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Apr 29, 2022

I think this might be fixed by #8478. Could you please try :latest?

@jeszii
Copy link
Author

jeszii commented Apr 29, 2022

Hi @alexec, thanks for the swift reply. :D
Can you elaborate on what needs to be upgraded to latest?

I had a colleague run the workflow on a fresh argo set up using this manifest and the problem seems to persist.

@alexec
Copy link
Contributor

alexec commented Apr 29, 2022

You need to upgrade the workflow-controller. Can you double check? It's pretty easy to apply the wrong manifests.

@jeszii
Copy link
Author

jeszii commented Apr 29, 2022

I can double check but it won't be until next week, sorry! I have attached the workflow that reproduces the error if you'd like to run it beforehand though.

@cwood-uk
Copy link

cwood-uk commented May 3, 2022

We have built argo locally off the master branch and ran the attached workflow, we are still seeing the same errors
Screenshot 2022-05-03 at 10 27 45

@jeszii
Copy link
Author

jeszii commented May 3, 2022

thanks @cwood-uk

@alexec
Copy link
Contributor

alexec commented May 3, 2022

This issue is missing workflow YAML to run locally.

@jeszii
Copy link
Author

jeszii commented May 3, 2022

@alexec The YAML is the reproducible-workflow.txt attached in the issue, Github doesn't allow the upload of YAML files. Have attached here again. :)
reproducible-workflow.txt

@alexec alexec added this to the 2022 Q1 milestone May 3, 2022
@alexec alexec removed this from the 2022 Q1 milestone May 3, 2022
@alexec alexec self-assigned this May 3, 2022
@alexec
Copy link
Contributor

alexec commented May 3, 2022

Thank you. Can repro locally.

@alexec
Copy link
Contributor

alexec commented May 3, 2022

Restarting the controller does not fix this.

@alexec
Copy link
Contributor

alexec commented May 4, 2022

Pods do complete correctly. The workflow status is not being updated. This points to a problem it the controller. E.g. in operator.go

@alexec
Copy link
Contributor

alexec commented May 4, 2022

Workflow stops reconciling, but is not labelled completed.

@alexec
Copy link
Contributor

alexec commented May 4, 2022

Hypothesis: Pods are being labelled as complete, before the node status has been update.

@alexec alexec changed the title containers in containersets not appropriately reporting status containers in containersets not appropriately reporting status if terminated May 4, 2022
@alexec
Copy link
Contributor

alexec commented May 4, 2022

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: masscan-dummy-scan
spec:
  entrypoint: run-plugin
  activeDeadlineSeconds: 20
  arguments:
    parameters:
      - name: plugin
        value: "masscan"
  templates:
    - name: run-plugin
      dag:
        tasks:
          - name: shard-1
            template: run-masscan
          - name: shard-2
            template: run-masscan
          - name: shard-3
            template: run-masscan
          - name: shard-4
            template: run-masscan
          - name: shard-5
            template: run-masscan
          - name: shard-6
            template: run-masscan
          - name: shard-7
            template: run-masscan
          - name: shard-8
            template: run-masscan
          - name: shard-9
            template: run-masscan
          - name: shard-10
            template: run-masscan
    - name: run-masscan
      containerSet:
        containers:
          - name: a
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "10"
          - name: b
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "80"
            dependencies:
              - a
          - name: c
            image: "debian:9.5-slim"
            command:
              - sleep
            args:
              - "30"
            dependencies:
              - b

Simpler. Faster.

@alexec
Copy link
Contributor

alexec commented May 4, 2022

This is caused by exec_control.go deleting pods. Instead, lets terminate them.

alexec added a commit to alexec/argo-workflows that referenced this issue May 5, 2022
@alexec alexec added size/S 1-2 days and removed triage labels May 5, 2022
@jeszii
Copy link
Author

jeszii commented May 6, 2022

Thanks for all your work on this, let me know if you need any help from us.

alexec added a commit that referenced this issue May 6, 2022
@sarabala1979 sarabala1979 mentioned this issue May 25, 2022
14 tasks
@jeszii
Copy link
Author

jeszii commented Jun 14, 2022

Hey @alexec, I noticed this was cherry picked for 3.3.6 so I upgraded, but I still seem to be getting the same issues where the workflow is still running after 'termination'. Am I right in thinking this was fixed in 3.3.6 or have I jumped the gun?
Screen Shot 2022-06-14 at 11 55 16 AM

@sarabala1979 sarabala1979 mentioned this issue Jun 20, 2022
55 tasks
@jeszii
Copy link
Author

jeszii commented Jun 22, 2022

Hey @sarabala1979, can you confirm whether this has been merged into the :latest tag? Right now I am still getting the same errors.
Screen Shot 2022-06-22 at 10 41 22 AM

@sarabala1979 sarabala1979 mentioned this issue Jun 23, 2022
51 tasks
@sarabala1979 sarabala1979 mentioned this issue Jul 30, 2022
51 tasks
@the1schwartz
Copy link

the1schwartz commented Sep 20, 2022

Screenshot 2022-09-20 at 10 11 47

I can still reproduce this in 3.3.9.

@jeszii
Copy link
Author

jeszii commented Sep 20, 2022

@the1schwartz @alexec I can also still reproduce this in 3.4.0

Screen Shot 2022-09-20 at 1 16 55 PM

@jeszii
Copy link
Author

jeszii commented Oct 4, 2022

fyi @alexec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants