Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

Closed
1 of 2 tasks
z103cb opened this issue Oct 11, 2023 · 1 comment · Fixed by #1539
Closed
1 of 2 tasks

[Bug] KubeRay Job CRD doesn't get it's status updated to "Failure" #1480

z103cb opened this issue Oct 11, 2023 · 1 comment · Fixed by #1539
Assignees
Labels
bug Something isn't working rayjob

Comments

@z103cb
Copy link
Contributor

z103cb commented Oct 11, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When submitting a Kuberay Job (ray job CRD) that is guaranteed to fail, the ray job status is updated to a JobStatusFailed with the Status object not being created.

The spawned batchv1.Job has its BackoffLimit set to the default value of 6, which would cause it to be rerun 6 more times, with the subsequent runs failing with this error message:

2023-10-11 01:08:19,318    INFO cli.py:36 -- Job submission server address: http://sample-cluster-head-svc.ray-system.svc.cluster.local:8265
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli.py", line 262, in submit
    job_id = client.submit_job(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/sdk.py", line 231, in submit_job
    self._raise_error(r)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
    resp = await job_agent_client.submit_job_internal(submit_request)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 80, in submit_job_internal
    await self._raise_error(resp)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 68, in _raise_error
    raise RuntimeError(f"Request failed with status code {status}: {error_text}.")
RuntimeError: Request failed with status code 400: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_agent.py", line 45, in submit_job
    submission_id = await self.get_job_manager().submit_job(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 903, in submit_job
    raise ValueError(
ValueError: Job with submission_id failing-sample-job-g5fkm already exists. Please use a different submission_id.

The expected behaviour is:

  1. The RayJob would have a failed status
  2. Only one batchv1.Job would be submitted and not retried.

Reproduction script

  1. Create kind cluster
  2. Deploy KubeRay operator
  3. Create namespace ray-system: kubectl create namespace ray-system
  4. Create ray cluster using the attached sample: kubectl apply -f
    cluster.yaml.txt
  5. Submit ray job using the attached sample ray job: kubectl apply -f
    fail_fast_job.yaml.txt

After 20 minutes or so the batchv1.Job status looks like this:

status:
  conditions:
  - lastProbeTime: "2023-10-11T08:27:32Z"
    lastTransitionTime: "2023-10-11T08:27:32Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 7
  startTime: "2023-10-11T08:07:29Z"

The ray job status is not update.

Deleting the ray job does not delete the batchv1 job. An odd thing I noticed while looking at the batchv1 job yaml is the owner labels, I suspect they should point to the RayJob instance.

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2023-10-11T08:07:29Z"
  generation: 1
  labels:
    controller-uid: 1d40a907-916f-426c-aa60-9c867ac1e389
    job-name: failing-sample-job
  name: failing-sample-job
  namespace: ray-system
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: sample-cluster
    uid: 9e997f94-8574-40a1-b12d-b44a15dfaef3
  resourceVersion: "69390"
  uid: 1d40a907-916f-426c-aa60-9c867ac1e389

Anything else

This might related to #1478 and #1233.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@z103cb z103cb added the bug Something isn't working label Oct 11, 2023
@astefanutti
Copy link
Contributor

@kevin85421 could you please assign it to me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rayjob
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants