Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path to non-existing key on GCS fails to be handled as optional input #6276

Closed
tymokvo opened this issue Jul 2, 2021 · 4 comments
Closed
Labels

Comments

@tymokvo
Copy link

tymokvo commented Jul 2, 2021

Summary

Preamble

What happened/what you expected to happen?

Before we start, around 2/3 of issues can be fixed by one of the following:

  • Have you double-checked your configuration? Maybe 30% of issues are wrong configuration.

Yep, other downloaded artifacts are fine.

  • Have you tested to see if it is fixed in the latest version? Maybe 20% of issues are fixed by this.

Nope, but the code in question hasn't been touched for 16 months.

  • Have you tried using the PNS executer instead of Docker? Maybe 50% of artifact related issues are fixed by this.

Nope, but this doesn't seem relevant as I believe the cause is the GCS storage client failing to make a file or raise an error.

Description

I expected an optional input to not cause an error when it does not exist on the local filesystem.

What happened was, the optional input caused: executor error: rename /argo/inputs/artifacts/ambient-cache.tmp /argo/inputs/artifacts/ambient-cache: no such file or directory.

When using Google Cloud Storage as an artifact repository, passing optional outputs/inputs between steps may cause an unexpected failure in the case that an optional output was not created in one step, and thus the key doesn't exist on GCS, but it is attempted to be passed as an optional input to a subsequent step. In the case that the key doesn't exist, an empty array skips a loop execution and fails to raise/catch an error for a missing optional artifact.

In this line, the errors.CodeNotFound is not raised because the listByPrefix method called by artDriver.Load returns an empty list and skips any other operations on the file.

I pulled out the methods from the argo-workflows/workflow/artifacts/gcs package into a gist here to demonstrate the problem. The listByPrefix method may return an empty list from the GCS API, which means that the loop in which downloadObject is called never executes. Thus, none of the file I/O errors are ever raised from inside downloadObject. The line in question is here.

It seems like the simplest fix would be to raise an error in the case of an empty array from listByPrefix that will be sent up the stack to be handled by the executor.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using?

GKE

What version of Argo Workflows are you running?

v3.1.0-rc14

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary

Emissary

Did this work in a previous version? I.e. is it a regression?

Unknown.

Are you pasting thousands of log lines? That's too much information.

Nope

Workflow

Our workflows that experience this have hundreds to thousands of lines, and seem like they would be more distracting than helpful here. I believe this issue is specific to the GCS storage adapter rather than the workflow itself. I'm happy to try and craft a minimal workflow if that does seem useful though. The gist that is linked above reproduces the failure case.


Failing Workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  generateName: gcs-test-
  labels:
    workflows.argoproj.io/container-runtime-executor: emissary
    disposable: 'true'
spec:
  entrypoint: gcs-test
  templates:
    - name: gcs-test
      inputs:
        artifacts:
          - name: my-art
            path: /my-artifact
            optional: true
            gcs:
              bucket: pollination-public
              key: blobs/argo-test/not-exists.txt
              serviceAccountKeySecret:
                name: gcs-creds
                key: serviceAccountKey
      container:
        image: ubuntu:latest
        command: [sh, -c]
        args: ["echo 'no artifact :('"]

Working Workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  generateName: gcs-test-
  labels:
    workflows.argoproj.io/container-runtime-executor: emissary
    disposable: 'true'
spec:
  entrypoint: gcs-test
  templates:
    - name: gcs-test
      inputs:
        artifacts:
          - name: my-art
            path: /my-artifact
            optional: true
            http:
              url: ''
      container:
        image: ubuntu:latest
        command: [sh, -c]
        args: ["echo 'no artifact :('"]

Misc.

Thank you for working on argo! Overall, it has been really great to work with. 😸

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sarabala1979
Copy link
Member

@tymokvo Is this issue happening consistently?
Can you share the wait container logs and controller logs? Can you provide sample reproducible workflow yaml?

@tymokvo
Copy link
Author

tymokvo commented Jul 7, 2021

@sarabala1979 thanks for the quick reply! I updated the issue with a failing workflow using GCS and, for comparison, one that succeeds using HTTP.

Init container logs:

time="2021-07-07T01:40:57.165Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.1.0-rc14
time="2021-07-07T01:40:57.169Z" level=info msg="Executor initialized" includeScriptOutput=false namespace=argo podName=gcs-test-4gtk4 template="{\"name\":\"gcs-test\",\"inputs\":{\"artifacts\":[{\"name\":\"my-art\",\"path\":\"/my-artifact\",\"gcs\":{\"bucket\":\"pollination-public\",\"serviceAccountKeySecret\":{\"name\":\"gcs-creds\",\"key\":\"serviceAccountKey\"},\"key\":\"blobs/argo-test/not-exists.txt\"},\"optional\":true}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"ubuntu:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"echo 'no artifact :('\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"gcs\":{\"bucket\":\"pollination-server-staging\",\"serviceAccountKeySecret\":{\"name\":\"gcs-creds\",\"key\":\"serviceAccountKey\"},\"key\":\"gcs-test-4gtk4/gcs-test-4gtk4\"}}}" version="&Version{Version:v3.1.0-rc14,BuildDate:2021-06-10T18:04:46Z,GitCommit:d385e6107ab8d4ea4826bd6972608f8fbc86fbe5,GitTag:v3.1.0-rc14,GitTreeState:clean,GoVersion:go1.15.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-07-07T01:40:57.235Z" level=info msg="Start loading input artifacts..."
time="2021-07-07T01:40:57.235Z" level=info msg="Downloading artifact: my-art"
time="2021-07-07T01:40:57.235Z" level=info msg="GCS Load path: /argo/inputs/artifacts/my-art.tmp, key: blobs/argo-test/not-exists.txt"
time="2021-07-07T01:40:57.375Z" level=info msg="Detecting if /argo/inputs/artifacts/my-art.tmp is a tarball"
time="2021-07-07T01:40:57.375Z" level=error msg="executor error: open /argo/inputs/artifacts/my-art.tmp: no such file or directory"
time="2021-07-07T01:40:57.375Z" level=info msg="Alloc=6978 TotalAlloc=15938 Sys=73553 NumGC=5 Goroutines=6"
time="2021-07-07T01:40:57.375Z" level=fatal msg="open /argo/inputs/artifacts/my-art.tmp: no such file or directory"

Wait container logs:

Error from server (BadRequest): container "wait" in pod "gcs-test-4gtk4" is waiting to start: PodInitializing

Controller logs:

time="2021-07-07T01:40:55.860Z" level=info msg="Processing workflow" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:40:55.864Z" level=info msg="Get configmaps 200"
time="2021-07-07T01:40:55.864Z" level=info msg="resolved artifact repository" artifactRepositoryRef="argo/#"
time="2021-07-07T01:40:55.866Z" level=info msg="Get configmaps 200"
time="2021-07-07T01:40:55.867Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:40:55.867Z" level=info msg="Pod node gcs-test-4gtk4 initialized Pending" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:40:55.872Z" level=info msg="Create events 201"
time="2021-07-07T01:40:55.882Z" level=info msg="Create pods 201"
time="2021-07-07T01:40:55.883Z" level=info msg="Created pod: gcs-test-4gtk4 (gcs-test-4gtk4)" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:40:55.889Z" level=info msg="Update workflows 200"
time="2021-07-07T01:40:55.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=61169066 workflow=gcs-test-4gtk4
time="2021-07-07T01:40:56.188Z" level=info msg="Watch workflows 200"
time="2021-07-07T01:40:58.742Z" level=info msg="Get leases 200"
time="2021-07-07T01:40:58.745Z" level=info msg="Update leases 200"
time="2021-07-07T01:41:03.749Z" level=info msg="Get leases 200"
time="2021-07-07T01:41:03.752Z" level=info msg="Update leases 200"
time="2021-07-07T01:41:05.884Z" level=info msg="Processing workflow" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.887Z" level=info msg="Get configmaps 200"
time="2021-07-07T01:41:05.888Z" level=info msg="Pod failed: Error (exit code 1): open /argo/inputs/artifacts/my-art.tmp: no such file or directory" displayName=gcs-test-4gtk4 namespace=argo pod=gcs-test-4gtk4 templateName=gcs-test workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.888Z" level=info msg="Updating node gcs-test-4gtk4 status Pending -> Error" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.888Z" level=info msg="Updating node gcs-test-4gtk4 message: Error (exit code 1): open /argo/inputs/artifacts/my-art.tmp: no such file or directory" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.890Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.890Z" level=info msg="Updated message  -> Error (exit code 1): open /argo/inputs/artifacts/my-art.tmp: no such file or directory" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.890Z" level=info msg="Marking workflow completed" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.890Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.890Z" level=info msg="Checking daemoned children of " namespace=argo workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.897Z" level=info msg="Create events 201"
time="2021-07-07T01:41:05.899Z" level=info msg="Update workflows 200"
time="2021-07-07T01:41:05.900Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=61169135 workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.902Z" level=info msg="archiving workflow" namespace=argo uid=ee1bbe7a-5b72-46b2-b1b2-6803b2049533 workflow=gcs-test-4gtk4
time="2021-07-07T01:41:05.904Z" level=info msg="Create events 201"
time="2021-07-07T01:41:05.907Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/gcs-test-4gtk4/labelPodCompleted
time="2021-07-07T01:41:05.908Z" level=info msg="Create events 201"
time="2021-07-07T01:41:05.916Z" level=info msg="Patch pods 200"
time="2021-07-07T01:41:05.929Z" level=info msg="Patch workflows 200"
time="2021-07-07T01:41:05.930Z" level=info msg="archiving workflow" namespace=argo uid=ee1bbe7a-5b72-46b2-b1b2-6803b2049533 workflow=gcs-test-4gtk4

@tymokvo
Copy link
Author

tymokvo commented Jul 15, 2021

@sarabala1979 I have patched our fork of argo with the change that I proposed here and can confirm that it solves the issue. Should I open a PR or is there more info I can provide?

@alexec
Copy link
Contributor

alexec commented Jul 27, 2021

Maybe fixed by #6393. Re-open if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants