Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] Add e2e test for sample RayJob yaml on kind #935

Merged
merged 20 commits into from
Apr 17, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Mar 1, 2023

Why are these changes needed?

Adds a test for the RayJob sample YAML to the Github Actions CI, similar to the existing RayCluster tests.

There is now a lot of repeated code in the RayJob, RayService and RayCluster tests, we should refactor this in the future.

Currently the test doesn't use any Rules, it just waits for the RayJob status to be SUCCEEDED. (This status originates from Ray.)

Passed 10/10 times locally.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@architkulkarni
Copy link
Contributor Author

@kevin85421 If buildkite is ready, I can try to run these tests on Buildkite instead of Github actions!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution! Do you enable pylint in your IDE?

tests/framework/prototype.py Outdated Show resolved Hide resolved
tests/framework/prototype.py Outdated Show resolved Hide resolved
tests/framework/prototype.py Outdated Show resolved Hide resolved
show_cluster_info(self.namespace)
raise Exception("RayJobAddCREvent wait() timeout")

def clean_up(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will RayJob automatically clean up the Ray Pods after it succeeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the terminate cluster on completion flag is set to False, though we may change this behavior in the future. (I'm not sure it's intentional that the default is False, and it doesn't seem to be documented at the moment)

@kevin85421
Copy link
Member

The standard runner in free GitHub Actions plan only has two CPUs. Hence, the test may be failed due to insufficient CPU. In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details. The YAML is tested in GitHub Actions at this moment.

@architkulkarni
Copy link
Contributor Author

Thank you for this contribution! Do you enable pylint in your IDE?

Ah sorry, didn't think about linting. If pylint is used for this project, shall I add it to https://github.com/ray-project/kuberay/blob/master/CONTRIBUTING.md? I could also make an issue to add the python linter to CI

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind running 10 times consecutively to check the flakiness of this test and add the result to the PR description? Thanks!

@@ -17,12 +17,18 @@ spec:
rayClusterSpec:
rayVersion: '2.3.0' # should match the Ray version in the image of the containers
# Ray head pod template
autoscalerOptions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we decide to add autoscalerOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was how I interpreted your message "In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details" but I wasn't 100% sure. Is there a better way to do it?

Also, should I move it to Buildkite CI, or is that part not ready yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, you can specify CPU in resource request/limit for both head/worker Pods.

resources:
limits:
cpu: "1"
requests:
cpu: "200m"

Also, should I move it to Buildkite CI, or is that part not ready yet?

We should move the test to Buildkite CI, but we can merge this PR as soon as possible, and open a new PR to move it to Buildkite CI. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, makes sense

pod_exec_command(headpod_name, cr_namespace, f"ray job status {JOB_ID}")
logger.info("Checking RayJob status succeeded")
# Check that "succeeded" is in the output of the command.
assert "succeeded" in shell_subprocess_run(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking log message is a bit risky (example: #617).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense... let me think of a more programmatic approach, or at the very least match on a longer and more precise string

tests/framework/utils.py Show resolved Hide resolved
@kevin85421
Copy link
Member

cc @Yicheng-Lu-llll for review

@architkulkarni
Copy link
Contributor Author

When I run the test repeatedly (even just with EasyJobRule()), it fails pretty often with "Failed to start Job Supervisor actor: The name _ray_internal_job_actor_rayjob-sample-zwtkn (namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE) is already taken. Please use a different name or get the existing actor using ray.get_actor('_ray_internal_job_actor_rayjob-sample-zwtkn', namespace='SUPERVISOR_ACTOR_RAY_NAMESPACE')."

It might be related to ray-project/ray#31356, I'll try to prioritize and fix this one. It might be more important for the KubeRay use case, since I remember we have some issue where the same job is submitted twice by KubeRay.

@architkulkarni
Copy link
Contributor Author

It failed 5/10 times with the above error. It seems the RayJob feature itself is flaky, not the test code. I think I should fix #756 first (using your idea of running ray job submit in the container commands) before merging this PR, to avoid making CI flaky.

Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni
Copy link
Contributor Author

@kevin85421 my fault, forgot to enable the new test, no need to review it now. I'll let you know when it's ready for review

@architkulkarni
Copy link
Contributor Author

Hi @kevin85421 , I think the PR is ready to be merged now after your next review.

  • Passed 10/10 times locally
  • Passed in Github Actions CI
  • I removed the RayJobSuccessRule. We can't actually get the status of the job from the custom resource using ray job status, because currently we internally append random letters to the sample job name. If there's another way to get the status, we can add it later, or we can wait until after we refactor RayJob to be a k8s job.
  • The "SUCCEEDED" status is checked in the RayJobAddCREvent itself. (The event is not considered "converged" until the job has succeeded.)

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 merged commit ef290e0 into ray-project:master Apr 17, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants