can't create TFJob from ResourceOp #2624

ryandawsonuk · 2019-11-18T16:29:20Z

I'm trying to run a pipeline to run a TFJob using the python below. It also used to work for a previous version of pipelines. I'm trying to update it to reflect changes in TFJob. But it fails before even attempting to create a TFJob.

Much the same TFJob definition with fixed values for the parameters works if I submit it directly (i.e. not through pipelines). If I extract the pipeline yaml and do a argo submit -n kubeflow then that works (creates TFJob and runs to completion). It only fails when I upload it through to the pipelines UI.

Error is:

time="2019-11-18T16:20:25Z" level=info msg="Executor (version: v2.3.0, build_date: 2019-05-20T22:10:54Z) initialized (pod: kubeflow/seldon-mnist-tfjob-gslnf-3364565779) with template:\n{\"name\":\"train\",\"inputs\":{\"parameters\":[{\"name\":\"docker-repo-training\",\"value\":\"seldonio/deepmnistclassifier_trainer\"},{\"name\":\"docker-tag-training\",\"value\":\"0.3\"},{\"name\":\"modelpvc-name\",\"value\":\"seldon-mnist-tfjob-gslnf-modelpvc\"}]},\"outputs\":{\"parameters\":[{\"name\":\"train-manifest\",\"valueFrom\":{\"jsonPath\":\"{}\"}},{\"name\":\"train-name\",\"valueFrom\":{\"jsonPath\":\"{.metadata.name}\"}}]},\"metadata\":{},\"resource\":{\"action\":\"create\",\"manifest\":\"apiVersion: kubeflow.org/v1\\nkind: TFJob\\nmetadata:\\n  name: mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262\\n  namespace: kubeflow\\n  ownerReferences:\\n  - apiVersion: argoproj.io/v1alpha1\\n    controller: true\\n    kind: Workflow\\n    name: 'seldon-mnist-tfjob-gslnf'\\n    uid: '374f3388-c9be-4bd1-9871-06b7f5efe262'\\nspec:\\n  tfReplicaSpecs:\\n    Worker:\\n      replicas: 1\\n      template:\\n        spec:\\n          containers:\\n          - image: 'seldonio/deepmnistclassifier_trainer:0.3'\\n            name: tensorflow\\n            volumeMounts:\\n            - mountPath: /data\\n              name: persistent-storage\\n          restartPolicy: OnFailure\\n          volumes:\\n          - name: persistent-storage\\n            persistentVolumeClaim:\\n              claimName: 'seldon-mnist-tfjob-gslnf-modelpvc'\\n      tfReplicaType: MASTER\\n\"}}"
time="2019-11-18T16:20:25Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-11-18T16:20:25Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2019-11-18T16:20:26Z" level=info msg=kubeflow/TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262
time="2019-11-18T16:20:26Z" level=info msg="Saving resource output parameters"
time="2019-11-18T16:20:26Z" level=info msg="[kubectl get TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262 -o jsonpath={} -n kubeflow]"
time="2019-11-18T16:20:27Z" level=error msg="`[kubectl get TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262 -o jsonpath={} -n kubeflow]` stderr:\nError from server (NotFound): tfjobs.kubeflow.org \"mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262\" not found\n"
time="2019-11-18T16:20:27Z" level=error msg="executor error: exit status 1\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveResourceParameters\n\t/go/src/github.com/argoproj/argo/workflow/executor/resource.go:290\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:55\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:21\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-11-18T16:20:27Z" level=fatal msg="exit status 1\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveResourceParameters\n\t/go/src/github.com/argoproj/argo/workflow/executor/resource.go:290\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:55\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:21\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"

Python is:


from kubernetes import client as k8s_client
import kfp.dsl as dsl
import json
from string import Template

@dsl.pipeline(
    name="Seldon MNIST TFJob",
    description="Example of training seldon MNIST TF model. Like kubeflow/example-seldon but using existing images."
)

def mnist_tfjob_volume(docker_repo_training='seldonio/deepmnistclassifier_trainer',
             docker_tag_training='0.3',
             docker_repo_serving='seldonio/deepmnistclassifier_runtime',
             docker_tag_serving='0.3'):

#use volume for storing model
#here model is saved and mounted into pre-defined image for serving
#alternatively model can be baked into image - for that see mabdeploy-seldon.py
#requires seldon v0.3.0 or higher
    modelvolop = dsl.VolumeOp(
        name="modelpvc",
        resource_name="modelpvc",
        size="50Mi",
        modes=dsl.VOLUME_MODE_RWO
    )

    tfjobjson_template = Template("""
{
	"apiVersion": "kubeflow.org/v1",
	"kind": "TFJob",
	"metadata": {
		"name": "mnist-train-{{workflow.uid}}",
        "namespace": "kubeflow",
		"ownerReferences": [
		{
			"apiVersion": "argoproj.io/v1alpha1",
			"kind": "Workflow",
			"controller": true,
			"name": "{{workflow.name}}",
			"uid": "{{workflow.uid}}"
		}
	    ]
	},
	"spec": {
		"tfReplicaSpecs": {
			"Worker": {
				"replicas": 1,
				"template": {
					"spec": {
						"containers": [
							{
								"image": "$dockerrepotraining:$dockertagtraining",
								"name": "tensorflow",
								"volumeMounts": [
									{
										"mountPath": "/data",
										"name": "persistent-storage"
									}
								]
							}
						],
						"restartPolicy": "OnFailure",
						"volumes": [
							{
								"name": "persistent-storage",
								"persistentVolumeClaim": {
									"claimName": "$modelpvc"
								}
							}
						]
					}
				},
				"tfReplicaType": "MASTER"
			}
		}
	}
}
""")

    tfjobjson = tfjobjson_template.substitute({ 'dockerrepotraining': str(docker_repo_training),'dockertagtraining': str(docker_tag_training),'modelpvc': modelvolop.outputs["name"]})

    tfjob = json.loads(tfjobjson)

    train = dsl.ResourceOp(
        name="train",
        k8s_resource=tfjob
    )



if __name__ == "__main__":
    import kfp.compiler as compiler
    compiler.Compiler().compile(mnist_tfjob_volume, __file__ + ".tar.gz")

The text was updated successfully, but these errors were encountered:

Ark-kun · 2019-11-18T23:17:26Z

If I extract the pipeline yaml and do a argo submit -n kubeflow then that works (creates TFJob and runs to completion). It only fails when I upload it through to the pipelines UI.

This looks strange. @IronPan Can you please take a look?

Ark-kun · 2019-11-18T23:20:00Z

We probably need to compare the Pod YAML in both cases.

ryandawsonuk · 2019-11-19T10:56:42Z

@Ark-kun Not sure what you mean about comparing Pod YAML? I've extracted what I can from doing argo get to compare the specs that are submitted to argo. I am finding it difficult to debug as in the failure case it doesn't even seem to submit the TFJob. My local argo client is v2.3.0-rc3. Perhaps this is broken with the argo client code inside the pipeline code. The message does look like it could be an argo error.

Am running on GKE. Installed with kfctl built from master, the existing_arrikto/isitio_dex platform.

gaoning777 · 2020-01-06T23:49:25Z

/unassign @gaoning777
/assign @Ark-kun

ryandawsonuk · 2020-02-13T11:24:30Z

Does this need triage or prioritising? Presumably this is still a recommended way of running TFJob with volumes?

stale · 2020-06-24T10:15:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-01T12:40:03Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

ryandawsonuk mentioned this issue Nov 18, 2019

tfjob problem in seldon example in latest kubeflow pipelines SeldonIO/seldon-core#1111

Closed

Ark-kun assigned IronPan and elikatsis Nov 18, 2019

Ark-kun added the area/backend label Nov 18, 2019

Ark-kun assigned gaoning777 Nov 26, 2019

k8s-ci-robot assigned Ark-kun and unassigned gaoning777 Jan 6, 2020

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 24, 2020

stale bot closed this as completed Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't create TFJob from ResourceOp #2624

can't create TFJob from ResourceOp #2624

ryandawsonuk commented Nov 18, 2019 •

edited

Loading

Ark-kun commented Nov 18, 2019

Ark-kun commented Nov 18, 2019

ryandawsonuk commented Nov 19, 2019 •

edited

Loading

gaoning777 commented Jan 6, 2020

ryandawsonuk commented Feb 13, 2020 •

edited

Loading

stale bot commented Jun 24, 2020

stale bot commented Jul 1, 2020

can't create TFJob from ResourceOp #2624

can't create TFJob from ResourceOp #2624

Comments

ryandawsonuk commented Nov 18, 2019 • edited Loading

Ark-kun commented Nov 18, 2019

Ark-kun commented Nov 18, 2019

ryandawsonuk commented Nov 19, 2019 • edited Loading

gaoning777 commented Jan 6, 2020

ryandawsonuk commented Feb 13, 2020 • edited Loading

stale bot commented Jun 24, 2020

stale bot commented Jul 1, 2020

ryandawsonuk commented Nov 18, 2019 •

edited

Loading

ryandawsonuk commented Nov 19, 2019 •

edited

Loading

ryandawsonuk commented Feb 13, 2020 •

edited

Loading