[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

z103cb · 2023-10-11T08:35:45Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When submitting a Kuberay Job (ray job CRD) that is guaranteed to fail, the ray job status is updated to a JobStatusFailed with the Status object not being created.

The spawned batchv1.Job has its BackoffLimit set to the default value of 6, which would cause it to be rerun 6 more times, with the subsequent runs failing with this error message:

2023-10-11 01:08:19,318    INFO cli.py:36 -- Job submission server address: http://sample-cluster-head-svc.ray-system.svc.cluster.local:8265
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli.py", line 262, in submit
    job_id = client.submit_job(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/sdk.py", line 231, in submit_job
    self._raise_error(r)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
    resp = await job_agent_client.submit_job_internal(submit_request)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 80, in submit_job_internal
    await self._raise_error(resp)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 68, in _raise_error
    raise RuntimeError(f"Request failed with status code {status}: {error_text}.")
RuntimeError: Request failed with status code 400: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_agent.py", line 45, in submit_job
    submission_id = await self.get_job_manager().submit_job(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 903, in submit_job
    raise ValueError(
ValueError: Job with submission_id failing-sample-job-g5fkm already exists. Please use a different submission_id.

The expected behaviour is:

The RayJob would have a failed status
Only one batchv1.Job would be submitted and not retried.

Reproduction script

Create kind cluster
Deploy KubeRay operator
Create namespace ray-system: kubectl create namespace ray-system
Create ray cluster using the attached sample: kubectl apply -f
cluster.yaml.txt
Submit ray job using the attached sample ray job: kubectl apply -f
fail_fast_job.yaml.txt

After 20 minutes or so the batchv1.Job status looks like this:

status:
  conditions:
  - lastProbeTime: "2023-10-11T08:27:32Z"
    lastTransitionTime: "2023-10-11T08:27:32Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 7
  startTime: "2023-10-11T08:07:29Z"

The ray job status is not update.

Deleting the ray job does not delete the batchv1 job. An odd thing I noticed while looking at the batchv1 job yaml is the owner labels, I suspect they should point to the RayJob instance.

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2023-10-11T08:07:29Z"
  generation: 1
  labels:
    controller-uid: 1d40a907-916f-426c-aa60-9c867ac1e389
    job-name: failing-sample-job
  name: failing-sample-job
  namespace: ray-system
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: sample-cluster
    uid: 9e997f94-8574-40a1-b12d-b44a15dfaef3
  resourceVersion: "69390"
  uid: 1d40a907-916f-426c-aa60-9c867ac1e389

Anything else

This might related to #1478 and #1233.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

astefanutti · 2023-10-11T17:44:04Z

@kevin85421 could you please assign it to me?

z103cb added the bug Something isn't working label Oct 11, 2023

kevin85421 assigned astefanutti Oct 11, 2023

kevin85421 added the rayjob label Oct 13, 2023

astefanutti mentioned this issue Oct 17, 2023

[RayJob] Fix RayJob status reconciliation #1539

Merged

4 tasks

kevin85421 closed this as completed in #1539 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

z103cb commented Oct 11, 2023

astefanutti commented Oct 11, 2023

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

Comments

z103cb commented Oct 11, 2023

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

astefanutti commented Oct 11, 2023