-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] Add e2e test for sample RayJob yaml on kind #935
[Test] Add e2e test for sample RayJob yaml on kind #935
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 If buildkite is ready, I can try to run these tests on Buildkite instead of Github actions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution! Do you enable pylint
in your IDE?
show_cluster_info(self.namespace) | ||
raise Exception("RayJobAddCREvent wait() timeout") | ||
|
||
def clean_up(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will RayJob automatically clean up the Ray Pods after it succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, the terminate cluster on completion flag is set to False, though we may change this behavior in the future. (I'm not sure it's intentional that the default is False, and it doesn't seem to be documented at the moment)
The standard runner in free GitHub Actions plan only has two CPUs. Hence, the test may be failed due to insufficient CPU. In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details. The YAML is tested in GitHub Actions at this moment. |
Ah sorry, didn't think about linting. If |
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>
…lkarni/kuberay into test-rayjob-sample-yaml
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
…test-rayjob-sample-yaml
Signed-off-by: Archit Kulkarni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind running 10 times consecutively to check the flakiness of this test and add the result to the PR description? Thanks!
@@ -17,12 +17,18 @@ spec: | |||
rayClusterSpec: | |||
rayVersion: '2.3.0' # should match the Ray version in the image of the containers | |||
# Ray head pod template | |||
autoscalerOptions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we decide to add autoscalerOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that was how I interpreted your message "In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details" but I wasn't 100% sure. Is there a better way to do it?
Also, should I move it to Buildkite CI, or is that part not ready yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, you can specify CPU in resource request/limit for both head/worker Pods.
kuberay/ray-operator/config/samples/ray-cluster.external-redis.yaml
Lines 153 to 157 in f747a99
resources: | |
limits: | |
cpu: "1" | |
requests: | |
cpu: "200m" |
Also, should I move it to Buildkite CI, or is that part not ready yet?
We should move the test to Buildkite CI, but we can merge this PR as soon as possible, and open a new PR to move it to Buildkite CI. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, makes sense
tests/framework/prototype.py
Outdated
pod_exec_command(headpod_name, cr_namespace, f"ray job status {JOB_ID}") | ||
logger.info("Checking RayJob status succeeded") | ||
# Check that "succeeded" is in the output of the command. | ||
assert "succeeded" in shell_subprocess_run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking log message is a bit risky (example: #617).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense... let me think of a more programmatic approach, or at the very least match on a longer and more precise string
cc @Yicheng-Lu-llll for review |
When I run the test repeatedly (even just with It might be related to ray-project/ray#31356, I'll try to prioritize and fix this one. It might be more important for the KubeRay use case, since I remember we have some issue where the same job is submitted twice by KubeRay. |
It failed 5/10 times with the above error. It seems the |
…test-rayjob-sample-yaml
…test-rayjob-sample-yaml
…test-rayjob-sample-yaml
…test-rayjob-sample-yaml Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 my fault, forgot to enable the new test, no need to review it now. I'll let you know when it's ready for review |
Signed-off-by: Archit Kulkarni <[email protected]>
…lkarni/kuberay into test-rayjob-sample-yaml
This reverts commit 23d5644.
Signed-off-by: Archit Kulkarni <[email protected]>
Hi @kevin85421 , I think the PR is ready to be merged now after your next review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add e2e test for sample RayJob yaml on kind
Why are these changes needed?
Adds a test for the
RayJob
sample YAML to the Github Actions CI, similar to the existingRayCluster
tests.There is now a lot of repeated code in the
RayJob
,RayService
andRayCluster
tests, we should refactor this in the future.Currently the test doesn't use any Rules, it just waits for the RayJob status to be SUCCEEDED. (This status originates from Ray.)
Passed 10/10 times locally.
Related issue number
Checks