Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StudyJob won't start and StudyJob Controller keeps crashing (invalid memory address) #358

Closed
rummens opened this issue Jan 31, 2019 · 7 comments
Assignees

Comments

@rummens
Copy link

rummens commented Jan 31, 2019

Hello,

we are expriementing with Katib but cannot get it to work. The random-example from the website is working but nothing else. We can create the job (using the UI) and the resource appears in the cluster but it wont start, keeps sitting there for ages. Resoruces (CPU, GPU, Memory) are avialable in the cluster. The job's image is on docker hub, so no authentication issues should appear.

Here is the content of the resource:

Name:         test6-job
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Creation Timestamp:  2019-01-31T16:26:08Z
  Generation:          1
  Resource Version:    8558014
  Self Link:           /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/test6-job
  UID:                 ea58d78a-2574-11e9-8d26-fe059a32208b
Spec:
  Metrics Collector Spec:
    Go Template:
      Template Path:  defaultMetricsCollectorTemplate.yaml
  Metricsnames:
    loss_val
  Objectivevaluename:  ACCURACY
  Optimizationgoal:    0.95
  Optimizationtype:    minimize
  Owner:               crd
  Parameterconfigs:
    Feasible:
      Max:          10
      Min:          1
    Name:           --epochs
    Parametertype:  int
  Requestcount:     3
  Study Name:       test6
  Suggestion Spec:
    Suggestion Algorithm:  random
    Suggestion Parameters:
  Worker Spec:
    Go Template:
      Template Path:  juan.yaml
Status:
  Early Stopping Parameter Id:  
  Suggestion Parameter Id:      
Events:                         <none>

On a maybe related note, the pod studyjob-controller keeps crashing. Removing and Readding it does not help. I tried multiple times to reinstall everything but it does not work. Here ist the log:

kubectl logs studyjob-controller-774d45f695-6ptjq 
2019/01/31 16:31:48 Registering Components.
2019/01/31 16:31:48 Starting the Cmd.
ERROR: logging before flag.Parse: E0131 16:31:49.114419       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/text/template/exec.go:160
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/usr/local/go/src/text/template/exec.go:214
/usr/local/go/src/text/template/exec.go:200
/go/src/github.com/kubeflow/katib/pkg/controller/studyjob/manifest_parser.go:177
/go/src/github.com/kubeflow/katib/pkg/controller/studyjob/utils.go:83
/go/src/github.com/kubeflow/katib/pkg/controller/studyjob/katib_api_util.go:28
/go/src/github.com/kubeflow/katib/pkg/controller/studyjob/studyjob_controller.go:190
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:207
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x662ffc]

goroutine 217 [running]:
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x1132940, 0x1e6cf90)
        /usr/local/go/src/runtime/panic.go:513 +0x1b9
text/template.errRecover(0xc000973810)
        /usr/local/go/src/text/template/exec.go:160 +0x1d0
panic(0x1132940, 0x1e6cf90)
        /usr/local/go/src/runtime/panic.go:513 +0x1b9
text/template.(*Template).execute(0x0, 0x13ec2a0, 0xc0006cfa40, 0x110eb60, 0xc0007accc0, 0x0, 0x0)
        /usr/local/go/src/text/template/exec.go:214 +0x1cc
text/template.(*Template).Execute(0x0, 0x13ec2a0, 0xc0006cfa40, 0x110eb60, 0xc0007accc0, 0x0, 0xc000486380)
        /usr/local/go/src/text/template/exec.go:200 +0x53
github.com/kubeflow/katib/pkg/controller/studyjob.getMetricsCollectorManifest(0x129edf6, 0x12, 0x129eef2, 0x12, 0xc00081d140, 0x24, 0xc00022b7dc, 0x3, 0xc00022be98, 0x8, ...)
        /go/src/github.com/kubeflow/katib/pkg/controller/studyjob/manifest_parser.go:177 +0x26f
github.com/kubeflow/katib/pkg/controller/studyjob.validateStudy(0xc000133b80, 0xc00022be98, 0x8, 0x9, 0x13f0760)
        /go/src/github.com/kubeflow/katib/pkg/controller/studyjob/utils.go:83 +0x5e3
github.com/kubeflow/katib/pkg/controller/studyjob.initializeStudy(0xc000133b80, 0xc00022be98, 0x8, 0x0, 0x0)
        /go/src/github.com/kubeflow/katib/pkg/controller/studyjob/katib_api_util.go:28 +0x83
github.com/kubeflow/katib/pkg/controller/studyjob.(*ReconcileStudyJobController).Reconcile(0xc0006955e0, 0xc00022be98, 0x8, 0xc00022bee0, 0x9, 0xc000662500, 0x0, 0x0)
        /go/src/github.com/kubeflow/katib/pkg/controller/studyjob/studyjob_controller.go:190 +0x66e
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003101e0, 0x0)
        /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:207 +0x13a
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
        /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x36
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000af8a00)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000af8a00, 0x3b9aca00, 0x0, 0x1, 0xc00065a540)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000af8a00, 0x3b9aca00, 0xc00065a540)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:156 +0x357

What is going wrong, is there any log of katib that we can look into? I had a look into all kativ related pods (katib-ui, vizier etc.) but none state something usefull.

Thanks
Marcel

@johnugeorge
Copy link
Member

johnugeorge commented Jan 31, 2019

I doubt, it is a problem with template parsing. See ConfigMap example in https://github.com/kubeflow/katib/blob/master/examples/grid-example.yaml and usage in https://github.com/kubeflow/katib/blob/master/examples/grid-example.yaml

To debug, can you also try by directly pasting template code using RawTemplate field(instead of TemplatePath) and run StudyJob again ?

@juan-sv
Copy link

juan-sv commented Feb 1, 2019

Hello,

I'm using same installation as Marcel and the command line isn't working either.
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/random-example.yaml -n kubeflow
I get the message:
studyjob.jubeflow.org "random-example" created

but nothing shows in Study List.

Thanks for your help,
Juan

@hougangliu
Copy link
Member

do you mean kubectl get studyjob -n kubeflow show nothing?

@juan-sv
Copy link

juan-sv commented Feb 4, 2019

Hello,

kubectl get studyjob -n kubeflow displays the job but seems that it took several hours to create (AGE is one day less than other tests we tried some minutes before). There are four studyjobs listed, but only one is displayed on the Katib UI.

Is there any way to check the status of these jobs?

Thanks,

@johnugeorge
Copy link
Member

For status: kubectl get studyjob <studyjobname> -n kubeflow -o yaml

@rummens
Copy link
Author

rummens commented Feb 5, 2019

FYI, an update to KubeFlow 4.1 fixed the issue with the studyjob-controller, it runs fine now.

@juan-sv Are the jobs created via YAML appear in the UI? Do jobs created by the UI work?

@hougangliu
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants