You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When submitting a Kuberay Job (ray job CRD) that is guaranteed to fail, the ray job status is updated to a JobStatusFailed with the Status object not being created.
The spawned batchv1.Job has its BackoffLimit set to the default value of 6, which would cause it to be rerun 6 more times, with the subsequent runs failing with this error message:
2023-10-11 01:08:19,318 INFO cli.py:36 -- Job submission server address: http://sample-cluster-head-svc.ray-system.svc.cluster.local:8265
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli.py", line 262, in submit
job_id = client.submit_job(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/sdk.py", line 231, in submit_job
self._raise_error(r)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
resp = await job_agent_client.submit_job_internal(submit_request)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 80, in submit_job_internal
await self._raise_error(resp)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 68, in _raise_error
raise RuntimeError(f"Request failed with status code {status}: {error_text}.")
RuntimeError: Request failed with status code 400: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_agent.py", line 45, in submit_job
submission_id = await self.get_job_manager().submit_job(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 903, in submit_job
raise ValueError(
ValueError: Job with submission_id failing-sample-job-g5fkm already exists. Please use a different submission_id.
The expected behaviour is:
The RayJob would have a failed status
Only one batchv1.Job would be submitted and not retried.
Create ray cluster using the attached sample: kubectl apply -f cluster.yaml.txt
Submit ray job using the attached sample ray job: kubectl apply -f fail_fast_job.yaml.txt
After 20 minutes or so the batchv1.Job status looks like this:
status:
conditions:
- lastProbeTime: "2023-10-11T08:27:32Z"lastTransitionTime: "2023-10-11T08:27:32Z"message: Job has reached the specified backoff limitreason: BackoffLimitExceededstatus: "True"type: Failedfailed: 7startTime: "2023-10-11T08:07:29Z"
The ray job status is not update.
Deleting the ray job does not delete the batchv1 job. An odd thing I noticed while looking at the batchv1 job yaml is the owner labels, I suspect they should point to the RayJob instance.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When submitting a Kuberay Job (ray job CRD) that is guaranteed to fail, the ray job status is updated to a JobStatusFailed with the Status object not being created.
The spawned
batchv1.Job
has itsBackoffLimit
set to the default value of 6, which would cause it to be rerun 6 more times, with the subsequent runs failing with this error message:The expected behaviour is:
Reproduction script
ray-system
:kubectl create namespace ray-system
cluster.yaml.txt
fail_fast_job.yaml.txt
After 20 minutes or so the
batchv1.Job
status looks like this:The ray job status is not update.
Deleting the ray job does not delete the batchv1 job. An odd thing I noticed while looking at the batchv1 job yaml is the owner labels, I suspect they should point to the RayJob instance.
Anything else
This might related to #1478 and #1233.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: