Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If fail to create metricsCollector, trails worker state cannot update #298

Closed
hougangliu opened this issue Dec 18, 2018 · 2 comments
Closed
Assignees

Comments

@hougangliu
Copy link
Member

When create a studyjob with wrong format meticsCollector template, the studyjob changes to Failed as expected. But worker job already created and run will not be deleted even it completed. And the studyJob.status.trails.workeridlist says that the worker condition is running

# kubectl get studyjob random-example3 -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha1
kind: StudyJob
metadata:
  creationTimestamp: 2018-12-18T09:07:26Z
  generation: 1
  labels:
    controller-tools.k8s.io: "1.0"
  name: random-example3
  namespace: kubeflow
  resourceVersion: "15105533"
  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example3
  uid: 56d9b7d0-02a4-11e9-bdb4-005056ad997c
spec:
  metricsnames:
  - accuracy
  objectivevaluename: Validation-accuracy
  optimizationgoal: 0.99
  optimizationtype: maximize
  owner: crd
  parameterconfigs:
  - feasible:
      max: "0.03"
      min: "0.01"
    name: --lr
    parametertype: double
  - feasible:
      max: "5"
      min: "2"
    name: --num-layers
    parametertype: int
  - feasible:
      list:
      - sgd
      - adam
      - ftrl
    name: --optimizer
    parametertype: categorical
  requestcount: 4
  studyName: random-example3
  suggestionSpec:
    requestNumber: 3
    suggestionAlgorithm: random
    suggestionParameters:
    - name: SuggestionCount
      value: "0"
  workerSpec:
    goTemplate:
      rawTemplate: |-
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: {{.WorkerID}}
          namespace: kubeflow
        spec:
          template:
            spec:
              containers:
              - name: {{.WorkerID}}
                image: katib/mxnet-mnist-example
                command:
                - "python"
                - "/mxnet/example/image-classification/train_mnist.py"
                - "--batch-size=64"
                {{- with .HyperParameters}}
                {{- range .}}
                - "{{.Name}}={{.Value}}"
                {{- end}}
                {{- end}}
              restartPolicy: Never
status:
  conditon: Failed
  earlyStoppingParameterId: ""
  studyid: i3fe786f1788c7df
  suggestionCount: 2
  suggestionParameterId: x3e163b1e2c4a444
  trials:
  - trialid: hb6a46812e7b4afb
    workeridlist:
    - completionTime: null
      conditon: Running
      kind: Job
      startTime: 2018-12-18T09:07:25Z
      workerid: w7b8bce3d1ff2bcf
  - trialid: v11510163d73c869
    workeridlist:
    - completionTime: null
      conditon: Running
      kind: Job
      startTime: 2018-12-18T09:07:25Z
      workerid: nae11a4e92b1493a
  - trialid: x24253c57f10b386
    workeridlist:
    - completionTime: null
      conditon: Running
      kind: Job
      startTime: 2018-12-18T09:07:25Z
      workerid: ybab4ed615748dce

# kubectl get job ybab4ed615748dce nae11a4e92b1493a w7b8bce3d1ff2bcf -n kubeflow
NAME               DESIRED   SUCCESSFUL   AGE
ybab4ed615748dce   1         1            48m
NAME               DESIRED   SUCCESSFUL   AGE
nae11a4e92b1493a   1         1            48m
NAME               DESIRED   SUCCESSFUL   AGE
w7b8bce3d1ff2bcf   1         1            48m
@hougangliu
Copy link
Member Author

hougangliu commented Dec 18, 2018

/assign @hougangliu

@hougangliu
Copy link
Member Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant