-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics not reporting to Katib server - experiment timing out #1905
Comments
Can you provide
Can you also provide final experiment config by |
|
Your experiment seems to be in the kubeflow namespace. Can you provide I see that you use Katib from master branch, while training operator from 1.3 release. |
and yes. I will try to install the versions you mentioned and see. |
The problem was fixed after installing the specified versions. Thank you. |
I am trying to create an experiment in kubeflow pipeline using python where I can hyperparameter tune an simple script. I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant report the metrics to katib server. Since the report is not happening, the experiment is timing out. So I need some help from the community.
Here is what I have tried :
The above given JSON is my trial spec. I have given the entire pipeline code below:
and this is my /opt/trainer/task.py code
The task.py code had more training logic. I have removed it since I only have problem with reporting some metrics to the katib server. Since I am simply taking some random number from the list and reporting to katib server, I am hoping it should work but it doesnt.
This gives me an error as given below:
So I changed the path to give the full path like /opt/trainer/katib/metrics.log. The experiment just time out when doing so.
FYI: the katib can get into my container and I can see the pods log saying its succeeded. but the metrics is not reporting. I would like some help from the community ASAP please.
Please comment if you need any more information. I have tried many other things but I cant post all the things here.
This is how I created my cluster and done all the installation
test.yaml file
references:
The text was updated successfully, but these errors were encountered: