Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: combination of mutex and memoization is broken #11853

Closed
2 of 3 tasks
Paritosh-Anand opened this issue Sep 20, 2023 · 4 comments · Fixed by #11883
Closed
2 of 3 tasks

bug: combination of mutex and memoization is broken #11853

Paritosh-Anand opened this issue Sep 20, 2023 · 4 comments · Fixed by #11883

Comments

@Paritosh-Anand
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

What happened

We have use cases where we do have memoization and synchronization together for a workflow step.

  1. For the 1st workflow execution of the step memoization cache is hit, then mutex lock is acquired by the 1st execution of the step.
  2. other execution of the same step waits for lock to be released from the 1st execution.
  3. 1st execution of the step is success and lock is released. However other execution of the same step keeps waiting for the lock and eventually fail after exceeding the deadline.

What is expected to happen

  1. For the 1st workflow execution of the step memoization cache is hit, then mutex lock is acquired by the 1st execution of the step.
  2. other execution of the same step waits for lock to be released from the 1st execution.
  3. once 1st execution is of the step is success and lock is released, then the next execution of the step should be able to acquire the lock and once the execution begins memoization will result in cache hit. Eventually all the execution of the step will get a chance to be executed and complete.

Related issues

#11219 - Mutex check happens after memoization cache check and lock info missing from UI
#11612 - parent level memoization is broken

How to replicate

Run the attached example workflow twice.

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-steps-
spec:
  entrypoint: main

  templates:
    - name: main
      steps:
        - - name: job-1
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 30
          - name: job-2
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 15

    - name: sleep
      synchronization:
        mutex:
          name: mutex-example-steps-simple
      inputs:
        parameters:
          - name: sleep_duration
      script:
        image: alpine:latest
        command: [/bin/sh]
        source: |
          echo "Sleeping for {{ inputs.parameters.sleep_duration }}"
          sleep {{ inputs.parameters.sleep_duration }}
      memoize:
        key: "memo-key-5"
        cache:
          configMap:
            name: cache-example-steps-simple

Logs from the workflow controller

=================start: 1st workflow logs=================

vscode ➜ ~/go/src/github.com/argoproj/argo-workflows (master) $ grep 'example-steps-kzp7m' logs/controller.log 
time="2023-09-20T23:29:43.774Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.798Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.799Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.808Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.808Z" level=info msg="Steps node example-steps-kzp7m initialized Running" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.817Z" level=info msg="StepGroup node example-steps-kzp7m-462101173 initialized Running" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.818Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.818Z" level=info msg="argo/Mutex/mutex-example-steps-simple acquired by argo/example-steps-kzp7m/example-steps-kzp7m-699090314. Lock availability: 0/1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:29:43.818Z" level=info msg="Node example-steps-kzp7m[0].job-1 acquired synchronization lock" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=info msg="Pod node example-steps-kzp7m-699090314 initialized Pending" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=info msg="Pod node example-steps-kzp7m-682312695 initialized Pending" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=info msg="Workflow step group node example-steps-kzp7m-462101173 not yet completed" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.828Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.829Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:43.842Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=389208 workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.761Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.762Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.763Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-kzp7m[0].job-2" nodePhase=Pending recentlyStarted=true workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.763Z" level=info msg="Lock has been released by argo/example-steps-kzp7m/example-steps-kzp7m-699090314. Available locks: 1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:29:44.763Z" level=info msg="argo/Mutex/mutex-example-steps-simple acquired by argo/example-steps-kzp7m/example-steps-kzp7m-682312695. Lock availability: 0/1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:29:44.763Z" level=info msg="Node example-steps-kzp7m[0].job-2 acquired synchronization lock" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.769Z" level=info msg="Node example-steps-kzp7m[0].job-2 is using mutex with memoize. Cache is hit." namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.769Z" level=info msg="Pod node example-steps-kzp7m-682312695 updated Succeeded[]" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.769Z" level=info msg="Step group node example-steps-kzp7m-462101173 successful" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.769Z" level=info msg="node example-steps-kzp7m-462101173 phase Running -> Succeeded" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="node example-steps-kzp7m-462101173 finished: 2023-09-20 23:29:44.769777636 +0000 UTC" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="Outbound nodes of example-steps-kzp7m-699090314 is [example-steps-kzp7m-699090314]" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="Outbound nodes of example-steps-kzp7m-682312695 is [example-steps-kzp7m-682312695]" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="Outbound nodes of example-steps-kzp7m is [example-steps-kzp7m-699090314 example-steps-kzp7m-682312695]" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="node example-steps-kzp7m phase Running -> Succeeded" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="node example-steps-kzp7m finished: 2023-09-20 23:29:44.770509886 +0000 UTC" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.770Z" level=info msg="Marking workflow completed" namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.773Z" level=info msg="Lock has been released by argo/example-steps-kzp7m/example-steps-kzp7m-682312695. Available locks: 1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:29:44.773Z" level=info msg="argo/example-steps-kzp7m/example-steps-kzp7m-682312695 released a lock from argo/Mutex/mutex-example-steps-simple"
time="2023-09-20T23:29:44.773Z" level=info msg="Released all acquired locks" key=example-steps-kzp7m namespace=argo workflow=example-steps-kzp7m
time="2023-09-20T23:29:44.778Z" level=info msg="cleaning up pod" action=deletePod key=argo/example-steps-kzp7m-1340600742-agent/deletePod
time="2023-09-20T23:29:44.780Z" level=info msg="argo/example-steps-kzp7m/example-steps-kzp7m-699090314 released a lock from argo/Mutex/mutex-example-steps-simple"
time="2023-09-20T23:29:44.780Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=389222 workflow=example-steps-kzp7m

==========================================================
=================start: 2nd workflow logs=================

vscode ➜ ~/go/src/github.com/argoproj/argo-workflows (master) $ grep 'example-steps-x9hzk' logs/controller.log 
time="2023-09-20T23:29:44.726Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.730Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.731Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.731Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.731Z" level=info msg="Steps node example-steps-x9hzk initialized Running" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.731Z" level=info msg="StepGroup node example-steps-x9hzk-3983811418 initialized Running" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.731Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=info msg="Pod node example-steps-x9hzk-617220597 initialized Pending" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=info msg="Pod node example-steps-x9hzk-566887740 initialized Pending" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.732Z" level=info msg="Workflow step group node example-steps-x9hzk-3983811418 not yet completed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.733Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.733Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:44.738Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=389217 workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.726Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.727Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.727Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-x9hzk[0].job-2" nodePhase=Pending recentlyStarted=true workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.727Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-x9hzk[0].job-1" nodePhase=Pending recentlyStarted=true workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.727Z" level=info msg="argo/Mutex/mutex-example-steps-simple acquired by argo/example-steps-x9hzk/example-steps-x9hzk-617220597. Lock availability: 0/1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:29:45.727Z" level=info msg="Node example-steps-x9hzk[0].job-1 acquired synchronization lock" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg="Node example-steps-x9hzk[0].job-1 is using mutex with memoize. Cache is hit." namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg="Pod node example-steps-x9hzk-617220597 updated Succeeded[]" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg="Workflow step group node example-steps-x9hzk-3983811418 not yet completed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.733Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:29:45.738Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=389236 workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.736Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.739Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.739Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-x9hzk[0].job-2" nodePhase=Pending recentlyStarted=false workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.739Z" level=info msg="node example-steps-x9hzk-566887740 phase Pending -> Failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.739Z" level=info msg="node example-steps-x9hzk-566887740 message: Step exceeded its deadline" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.739Z" level=info msg="node example-steps-x9hzk-566887740 finished: 2023-09-20 23:49:45.739736261 +0000 UTC" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="Lock has been released by argo/example-steps-x9hzk/example-steps-x9hzk-617220597. Available locks: 1" mutex=argo/Mutex/mutex-example-steps-simple
time="2023-09-20T23:49:45.741Z" level=info msg="Step group node example-steps-x9hzk-3983811418 deemed failed: child 'example-steps-x9hzk-566887740' failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk-3983811418 phase Running -> Failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk-3983811418 message: child 'example-steps-x9hzk-566887740' failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk-3983811418 finished: 2023-09-20 23:49:45.741660511 +0000 UTC" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="step group example-steps-x9hzk-3983811418 was unsuccessful: child 'example-steps-x9hzk-566887740' failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="Outbound nodes of example-steps-x9hzk-617220597 is [example-steps-x9hzk-617220597]" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="Outbound nodes of example-steps-x9hzk-566887740 is [example-steps-x9hzk-566887740]" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="Outbound nodes of example-steps-x9hzk is [example-steps-x9hzk-617220597 example-steps-x9hzk-566887740]" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk phase Running -> Failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk message: child 'example-steps-x9hzk-566887740' failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.741Z" level=info msg="node example-steps-x9hzk finished: 2023-09-20 23:49:45.741840594 +0000 UTC" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.744Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.744Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.744Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.744Z" level=info msg="Updated message  -> child 'example-steps-x9hzk-566887740' failed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.744Z" level=info msg="Marking workflow completed" namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.745Z" level=info msg="Released all acquired locks" key=example-steps-x9hzk namespace=argo workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.751Z" level=info msg="cleaning up pod" action=deletePod key=argo/example-steps-x9hzk-1340600742-agent/deletePod
time="2023-09-20T23:49:45.757Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=389839 workflow=example-steps-x9hzk
time="2023-09-20T23:49:45.759Z" level=info msg="argo/example-steps-x9hzk/example-steps-x9hzk-617220597 released a lock from argo/Mutex/mutex-example-steps-simple"

==========================================================
=================start: 3rd workflow logs=================

vscode ➜ ~/go/src/github.com/argoproj/argo-workflows (master) $ grep 'example-steps-l9xg9' logs/controller.log 
time="2023-09-20T23:29:45.677Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="Steps node example-steps-l9xg9 initialized Running" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="StepGroup node example-steps-l9xg9-2046647879 initialized Running" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="Pod node example-steps-l9xg9-2376714340 initialized Pending" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.681Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.682Z" level=info msg="Pod node example-steps-l9xg9-2427047197 initialized Pending" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.682Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.682Z" level=info msg="Workflow step group node example-steps-l9xg9-2046647879 not yet completed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.682Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.682Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:45.690Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=389232 workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.680Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.680Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.680Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-l9xg9[0].job-1" nodePhase=Pending recentlyStarted=true workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-l9xg9[0].job-2" nodePhase=Pending recentlyStarted=true workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg="Could not acquire lock named: &{argo mutex-example-steps-simple  Mutex}" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg="Workflow step group node example-steps-l9xg9-2046647879 not yet completed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.681Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:29:46.688Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=389240 workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.691Z" level=info msg="Processing workflow" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.693Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-l9xg9[0].job-1" nodePhase=Pending recentlyStarted=false workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="example-steps-l9xg9[0].job-2" nodePhase=Pending recentlyStarted=false workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2376714340 phase Pending -> Failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2376714340 message: Step exceeded its deadline" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2376714340 finished: 2023-09-20 23:49:46.694289137 +0000 UTC" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2427047197 phase Pending -> Failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2427047197 message: Step exceeded its deadline" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.694Z" level=info msg="node example-steps-l9xg9-2427047197 finished: 2023-09-20 23:49:46.694431595 +0000 UTC" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.698Z" level=info msg="Step group node example-steps-l9xg9-2046647879 deemed failed: child 'example-steps-l9xg9-2376714340' failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.698Z" level=info msg="node example-steps-l9xg9-2046647879 phase Running -> Failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.698Z" level=info msg="node example-steps-l9xg9-2046647879 message: child 'example-steps-l9xg9-2376714340' failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.698Z" level=info msg="node example-steps-l9xg9-2046647879 finished: 2023-09-20 23:49:46.698960637 +0000 UTC" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="step group example-steps-l9xg9-2046647879 was unsuccessful: child 'example-steps-l9xg9-2376714340' failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="Outbound nodes of example-steps-l9xg9-2376714340 is [example-steps-l9xg9-2376714340]" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="Outbound nodes of example-steps-l9xg9-2427047197 is [example-steps-l9xg9-2427047197]" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="Outbound nodes of example-steps-l9xg9 is [example-steps-l9xg9-2376714340 example-steps-l9xg9-2427047197]" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="node example-steps-l9xg9 phase Running -> Failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="node example-steps-l9xg9 message: child 'example-steps-l9xg9-2376714340' failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="node example-steps-l9xg9 finished: 2023-09-20 23:49:46.699670595 +0000 UTC" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg=reconcileAgentPod namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.699Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.700Z" level=info msg="Updated message  -> child 'example-steps-l9xg9-2376714340' failed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.700Z" level=info msg="Marking workflow completed" namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.702Z" level=info msg="Released all acquired locks" key=example-steps-l9xg9 namespace=argo workflow=example-steps-l9xg9
time="2023-09-20T23:49:46.706Z" level=info msg="cleaning up pod" action=deletePod key=argo/example-steps-l9xg9-1340600742-agent/deletePod
time="2023-09-20T23:49:46.720Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=389846 workflow=example-steps-l9xg9

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@Paritosh-Anand Paritosh-Anand changed the title bug: mutex and memoization is broken bug: combination of mutex and memoization is broken Sep 21, 2023
@terrytangyuan
Copy link
Member

cc @shmruin Has this been fixed?

@agilgur5 agilgur5 added the P3 Low priority label Sep 21, 2023
@shmruin
Copy link
Contributor

shmruin commented Sep 21, 2023

@terrytangyuan , I run this example code both master and release-3.4.11 branch but cannot reproduce the problem. (in my local environment)
Both works correctly as expected and latest release of 3.4.11 also looks good.
But if this situation happens, it sounds like the mutex key is not released on specific situation.
I will keep looking into it.
@Paritosh-Anand , any other hints for that situation? like workflow duration or running multiple workflows at the same time. any small things would be helpful.

run_result_mutex_sync

@Paritosh-Anand
Copy link
Author

Thanks for investigating this @shmruin

I checked again today on v3.4.11 and I was able to reproduce by submitting 3 workflow at the same time.

@shmruin
Copy link
Contributor

shmruin commented Sep 23, 2023

@Paritosh-Anand OK, I see. I can reproduce this error by executing multiple workflows with same key at the same time.
I'll check the logic and try to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants