Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't create TFJob from ResourceOp #2624

Closed
ryandawsonuk opened this issue Nov 18, 2019 · 7 comments
Closed

can't create TFJob from ResourceOp #2624

ryandawsonuk opened this issue Nov 18, 2019 · 7 comments
Assignees
Labels
area/backend lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Nov 18, 2019

I'm trying to run a pipeline to run a TFJob using the python below. It also used to work for a previous version of pipelines. I'm trying to update it to reflect changes in TFJob. But it fails before even attempting to create a TFJob.

Much the same TFJob definition with fixed values for the parameters works if I submit it directly (i.e. not through pipelines). If I extract the pipeline yaml and do a argo submit -n kubeflow then that works (creates TFJob and runs to completion). It only fails when I upload it through to the pipelines UI.

Error is:

time="2019-11-18T16:20:25Z" level=info msg="Executor (version: v2.3.0, build_date: 2019-05-20T22:10:54Z) initialized (pod: kubeflow/seldon-mnist-tfjob-gslnf-3364565779) with template:\n{\"name\":\"train\",\"inputs\":{\"parameters\":[{\"name\":\"docker-repo-training\",\"value\":\"seldonio/deepmnistclassifier_trainer\"},{\"name\":\"docker-tag-training\",\"value\":\"0.3\"},{\"name\":\"modelpvc-name\",\"value\":\"seldon-mnist-tfjob-gslnf-modelpvc\"}]},\"outputs\":{\"parameters\":[{\"name\":\"train-manifest\",\"valueFrom\":{\"jsonPath\":\"{}\"}},{\"name\":\"train-name\",\"valueFrom\":{\"jsonPath\":\"{.metadata.name}\"}}]},\"metadata\":{},\"resource\":{\"action\":\"create\",\"manifest\":\"apiVersion: kubeflow.org/v1\\nkind: TFJob\\nmetadata:\\n  name: mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262\\n  namespace: kubeflow\\n  ownerReferences:\\n  - apiVersion: argoproj.io/v1alpha1\\n    controller: true\\n    kind: Workflow\\n    name: 'seldon-mnist-tfjob-gslnf'\\n    uid: '374f3388-c9be-4bd1-9871-06b7f5efe262'\\nspec:\\n  tfReplicaSpecs:\\n    Worker:\\n      replicas: 1\\n      template:\\n        spec:\\n          containers:\\n          - image: 'seldonio/deepmnistclassifier_trainer:0.3'\\n            name: tensorflow\\n            volumeMounts:\\n            - mountPath: /data\\n              name: persistent-storage\\n          restartPolicy: OnFailure\\n          volumes:\\n          - name: persistent-storage\\n            persistentVolumeClaim:\\n              claimName: 'seldon-mnist-tfjob-gslnf-modelpvc'\\n      tfReplicaType: MASTER\\n\"}}"
time="2019-11-18T16:20:25Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2019-11-18T16:20:25Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2019-11-18T16:20:26Z" level=info msg=kubeflow/TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262
time="2019-11-18T16:20:26Z" level=info msg="Saving resource output parameters"
time="2019-11-18T16:20:26Z" level=info msg="[kubectl get TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262 -o jsonpath={} -n kubeflow]"
time="2019-11-18T16:20:27Z" level=error msg="`[kubectl get TFJob.kubeflow.org/mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262 -o jsonpath={} -n kubeflow]` stderr:\nError from server (NotFound): tfjobs.kubeflow.org \"mnist-train-374f3388-c9be-4bd1-9871-06b7f5efe262\" not found\n"
time="2019-11-18T16:20:27Z" level=error msg="executor error: exit status 1\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveResourceParameters\n\t/go/src/github.com/argoproj/argo/workflow/executor/resource.go:290\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:55\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:21\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-11-18T16:20:27Z" level=fatal msg="exit status 1\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveResourceParameters\n\t/go/src/github.com/argoproj/argo/workflow/executor/resource.go:290\ngithub.com/argoproj/argo/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:55\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/resource.go:21\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"

Python is:


from kubernetes import client as k8s_client
import kfp.dsl as dsl
import json
from string import Template

@dsl.pipeline(
    name="Seldon MNIST TFJob",
    description="Example of training seldon MNIST TF model. Like kubeflow/example-seldon but using existing images."
)

def mnist_tfjob_volume(docker_repo_training='seldonio/deepmnistclassifier_trainer',
             docker_tag_training='0.3',
             docker_repo_serving='seldonio/deepmnistclassifier_runtime',
             docker_tag_serving='0.3'):

#use volume for storing model
#here model is saved and mounted into pre-defined image for serving
#alternatively model can be baked into image - for that see mabdeploy-seldon.py
#requires seldon v0.3.0 or higher
    modelvolop = dsl.VolumeOp(
        name="modelpvc",
        resource_name="modelpvc",
        size="50Mi",
        modes=dsl.VOLUME_MODE_RWO
    )

    tfjobjson_template = Template("""
{
	"apiVersion": "kubeflow.org/v1",
	"kind": "TFJob",
	"metadata": {
		"name": "mnist-train-{{workflow.uid}}",
        "namespace": "kubeflow",
		"ownerReferences": [
		{
			"apiVersion": "argoproj.io/v1alpha1",
			"kind": "Workflow",
			"controller": true,
			"name": "{{workflow.name}}",
			"uid": "{{workflow.uid}}"
		}
	    ]
	},
	"spec": {
		"tfReplicaSpecs": {
			"Worker": {
				"replicas": 1,
				"template": {
					"spec": {
						"containers": [
							{
								"image": "$dockerrepotraining:$dockertagtraining",
								"name": "tensorflow",
								"volumeMounts": [
									{
										"mountPath": "/data",
										"name": "persistent-storage"
									}
								]
							}
						],
						"restartPolicy": "OnFailure",
						"volumes": [
							{
								"name": "persistent-storage",
								"persistentVolumeClaim": {
									"claimName": "$modelpvc"
								}
							}
						]
					}
				},
				"tfReplicaType": "MASTER"
			}
		}
	}
}
""")

    tfjobjson = tfjobjson_template.substitute({ 'dockerrepotraining': str(docker_repo_training),'dockertagtraining': str(docker_tag_training),'modelpvc': modelvolop.outputs["name"]})

    tfjob = json.loads(tfjobjson)

    train = dsl.ResourceOp(
        name="train",
        k8s_resource=tfjob
    )



if __name__ == "__main__":
    import kfp.compiler as compiler
    compiler.Compiler().compile(mnist_tfjob_volume, __file__ + ".tar.gz")
@Ark-kun
Copy link
Contributor

Ark-kun commented Nov 18, 2019

If I extract the pipeline yaml and do a argo submit -n kubeflow then that works (creates TFJob and runs to completion). It only fails when I upload it through to the pipelines UI.

This looks strange. @IronPan Can you please take a look?

@Ark-kun
Copy link
Contributor

Ark-kun commented Nov 18, 2019

We probably need to compare the Pod YAML in both cases.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Nov 19, 2019

@Ark-kun Not sure what you mean about comparing Pod YAML? I've extracted what I can from doing argo get to compare the specs that are submitted to argo. I am finding it difficult to debug as in the failure case it doesn't even seem to submit the TFJob. My local argo client is v2.3.0-rc3. Perhaps this is broken with the argo client code inside the pipeline code. The message does look like it could be an argo error.

Am running on GKE. Installed with kfctl built from master, the existing_arrikto/isitio_dex platform.

@gaoning777
Copy link
Contributor

/unassign @gaoning777
/assign @Ark-kun

@k8s-ci-robot k8s-ci-robot assigned Ark-kun and unassigned gaoning777 Jan 6, 2020
@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Feb 13, 2020

Does this need triage or prioritising? Presumably this is still a recommended way of running TFJob with volumes?

@stale
Copy link

stale bot commented Jun 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 24, 2020
@stale
Copy link

stale bot commented Jul 1, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
None yet
Development

No branches or pull requests

5 participants