Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcjob should store err_msg and set it's status to failed when it's task failed #740

Closed
silenceli opened this issue Mar 12, 2020 · 2 comments · Fixed by #746
Closed

vcjob should store err_msg and set it's status to failed when it's task failed #740

silenceli opened this issue Mar 12, 2020 · 2 comments · Fixed by #746
Assignees

Comments

@silenceli
Copy link

silenceli commented Mar 12, 2020

Launch one vcjob with only one task. If the task failed(OOM or some other reason),we can find the failure reason from the pod but could't from the vcjob.

I think vcjob should collect the tasks belong to is finished or failed and what's the failure reason when tasks failed. If one of task is failed and user set restartpolicy is False, I think the status of vcjob should be FAILED instead of "Completed".

As a user, If I only monitor vcjob, I don't know the vcjob is executed successful or failed.

One Example:

  1. start vcjob with only one task
{
 "kind": "Job", 
 "spec": {
  "queue": "training-jarvis-dev-team-k8s-training-test", 
  "tasks": [
   {
    "policies": [
     {
      "action": "CompleteJob", 
      "event": "TaskCompleted"
     },
    ], 
    "name": "master", 
    "template": {
     "spec": {
      "restartPolicy": "Never", 
      "initContainers": [
       {
        "image": "docker-registry.qiyi.virtual/jarvis-image/jarvis-k8s-training-runtime:zdev-03", 
        "volumeMounts": [
         {
          "mountPath": "/opt/ml/bin", 
          "name": "app-volume"
         }
        ], 
        "name": "runtime"
       }
      ], 
      "containers": [
       {
        "securityContext": {
         "capabilities": {
          "add": [
           "SYS_ADMIN"
          ]
         }
        }, 
        "name": "jarvis-single-job", 
        "image": "docker-registry.qiyi.virtual/jarvis-public/conda-tensorflow-1.7-py36:v1.2", 
        "volumeMounts": [
         {
          "mountPath": "/opt/ml/bin", 
          "name": "app-volume"
         }
        ], 
        "command": [
         "/bin/bash", 
         "/opt/ml/bin/start.sh"
        ], 
        "ports": [
         {
          "containerPort": 2222, 
          "name": "tfjob-port"
         }
        ], 
        "resources": {
         "requests": {
          "qiyi.com/fuse": 1, 
          "cpu": 8.0, 
          "nvidia.com/gpu": 1.0, 
          "memory": "200.0Mi"
         }, 
         "limits": {
          "qiyi.com/fuse": 1, 
          "cpu": 8.0, 
          "nvidia.com/gpu": 1.0, 
          "memory": "200.0Mi"
         }
        }
       }
      ], 
      "volumes": [
       {
        "emptyDir": {}, 
        "name": "app-volume"
       }
      ]
     }
    }, 
    "replicas": 1
   }
  ], 
  "policies": [
   {
    "action": "RestartJob", 
    "event": "PodEvicted"
   }
  ], 
  "schedulerName": "volcano", 
  "minAvailable": 1, 
  "plugins": {
   "svc": [], 
   "env": []
  }
 }, 
 "apiVersion": "batch.volcano.sh/v1alpha1", 
 "metadata": {
  "labels": {
   "cluster_name": "k8s-training-test", 
   "group": "jarvis-public", 
   "name": "tensorflow-models200312190939", 
   "creator": "lihao04", 
   "jarvis-qrn-type": "k8s-training", 
   "team_name": "jarvis-dev-team"
  }, 
  "namespace": "training-jarvis-dev-team-k8s-training-test", 
  "name": "tensorflow-models200312190939"
 }
}
  1. vcjob started
    We can see the task of vcjob is running
[root@localhost ~]# kubectl get vcjob --all-namespaces
NAMESPACE                                    NAME                            AGE
training-jarvis-dev-team-k8s-training-test   tensorflow-models200312183411   39m
training-jarvis-dev-team-k8s-training-test   tensorflow-models200312190939   68s

[root@localhost ~]# kubectl get pod --all-namespaces
...
training-jarvis-dev-team-k8s-training-test   tensorflow-models200312190939-master-0                           1/1     Running   0          15s    10.244.3.93     mesos-gpu-online001-bjdxt9.cloud.qiyi.domain   <none>           <none>
...
  1. oom occurs, pod is setted to OOMKilled but vcjob setted to Completed
# Pod show OOMKilled
[root@localhost ~]# kubectl get pod --all-namespaces
...
training-jarvis-dev-team-k8s-training-test   tensorflow-models200312190939-master-0                           0/1     OOMKilled   0          2m6s   10.244.3.93     mesos-gpu-online001-bjdxt9.cloud.qiyi.domain   <none>           <none>
...

# vcjob show Completed
[root@localhost ~]# kubectl get vcjob tensorflow-models200312190939 -n training-jarvis-dev-team-k8s-training-test -o json
{
    "apiVersion": "batch.volcano.sh/v1alpha1",
    "kind": "Job",
    "metadata": {
        "creationTimestamp": "2020-03-12T11:12:52Z",
        "generation": 1,
        "labels": {
            "cluster_name": "k8s-training-test",
            "creator": "lihao04",
            "group": "jarvis-public",
            "jarvis-qrn-type": "k8s-training",
            "name": "tensorflow-models200312190939",
            "team_name": "jarvis-dev-team"
        },
        "name": "tensorflow-models200312190939",
        "namespace": "training-jarvis-dev-team-k8s-training-test",
        "resourceVersion": "30907790",
        "selfLink": "/apis/batch.volcano.sh/v1alpha1/namespaces/training-jarvis-dev-team-k8s-training-test/jobs/tensorflow-models200312190939",
        "uid": "88f4e5e3-07ab-44cf-8e27-690fdf1c025b"
    },
    "spec": {
        "minAvailable": 1,
        "plugins": {
            "env": [],
            "svc": []
        },
        "policies": [
            {
                "action": "RestartJob",
                "event": "PodEvicted"
            }
        ],
        "queue": "training-jarvis-dev-team-k8s-training-test",
        "schedulerName": "volcano",
        "tasks": [
            {
                "name": "master",
                "policies": [
                    {
                        "action": "CompleteJob",
                        "event": "TaskCompleted"
                    }
                ],

...

    "status": {
        "minAvailable": 1,
        "state": {
            "lastTransitionTime": "2020-03-12T11:14:16Z",
            "phase": "Completed"
        },
        "succeeded": 1,
        "version": 3
    }
}
@k82cn
Copy link
Member

k82cn commented Mar 22, 2020

@hzxuzhonghu , would you help on this one?

@hzxuzhonghu
Copy link
Collaborator

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants