[Test] Add e2e test for sample RayJob yaml on kind #935

architkulkarni · 2023-03-01T00:16:09Z

Why are these changes needed?

Adds a test for the RayJob sample YAML to the Github Actions CI, similar to the existing RayCluster tests.

There is now a lot of repeated code in the RayJob, RayService and RayCluster tests, we should refactor this in the future.

Currently the test doesn't use any Rules, it just waits for the RayJob status to be SUCCEEDED. (This status originates from Ray.)

Passed 10/10 times locally.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-03-01T00:18:18Z

@kevin85421 If buildkite is ready, I can try to run these tests on Buildkite instead of Github actions!

kevin85421

Thank you for this contribution! Do you enable pylint in your IDE?

tests/framework/prototype.py

kevin85421 · 2023-03-01T18:53:25Z

tests/framework/prototype.py

+            show_cluster_info(self.namespace)
+            raise Exception("RayJobAddCREvent wait() timeout")
+
+    def clean_up(self):


Will RayJob automatically clean up the Ray Pods after it succeeds?

By default, the terminate cluster on completion flag is set to False, though we may change this behavior in the future. (I'm not sure it's intentional that the default is False, and it doesn't seem to be documented at the moment)

kevin85421 · 2023-03-01T19:01:19Z

The standard runner in free GitHub Actions plan only has two CPUs. Hence, the test may be failed due to insufficient CPU. In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details. The YAML is tested in GitHub Actions at this moment.

architkulkarni · 2023-03-06T18:14:52Z

Thank you for this contribution! Do you enable pylint in your IDE?

Ah sorry, didn't think about linting. If pylint is used for this project, shall I add it to https://github.com/ray-project/kuberay/blob/master/CONTRIBUTING.md? I could also make an issue to add the python linter to CI

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

…lkarni/kuberay into test-rayjob-sample-yaml

Signed-off-by: Archit Kulkarni <[email protected]>

…test-rayjob-sample-yaml

Signed-off-by: Archit Kulkarni <[email protected]>

kevin85421

Would you mind running 10 times consecutively to check the flakiness of this test and add the result to the PR description? Thanks!

kevin85421 · 2023-03-12T07:47:27Z

ray-operator/config/samples/ray_v1alpha1_rayjob.yaml

@@ -17,12 +17,18 @@ spec:
  rayClusterSpec:
    rayVersion: '2.3.0' # should match the Ray version in the image of the containers
    # Ray head pod template
+    autoscalerOptions:


Why do we decide to add autoscalerOptions?

Oh, that was how I interpreted your message "In that case, you can specify CPU in resource request/limit for both head/worker Pods. See https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml for more details" but I wasn't 100% sure. Is there a better way to do it?

Also, should I move it to Buildkite CI, or is that part not ready yet?

In that case, you can specify CPU in resource request/limit for both head/worker Pods.

kuberay/ray-operator/config/samples/ray-cluster.external-redis.yaml

Lines 153 to 157 in f747a99

resources:

limits:

cpu: "1"

requests:

cpu: "200m"

Also, should I move it to Buildkite CI, or is that part not ready yet?

We should move the test to Buildkite CI, but we can merge this PR as soon as possible, and open a new PR to move it to Buildkite CI. Does it make sense?

Yup, makes sense

kevin85421 · 2023-03-12T07:51:51Z

tests/framework/prototype.py

+        pod_exec_command(headpod_name, cr_namespace, f"ray job status {JOB_ID}")
+        logger.info("Checking RayJob status succeeded")
+        # Check that "succeeded" is in the output of the command.
+        assert "succeeded" in shell_subprocess_run(


Checking log message is a bit risky (example: #617).

Ah makes sense... let me think of a more programmatic approach, or at the very least match on a longer and more precise string

tests/framework/utils.py

kevin85421 · 2023-03-12T08:06:21Z

cc @Yicheng-Lu-llll for review

architkulkarni · 2023-03-13T17:30:42Z

When I run the test repeatedly (even just with EasyJobRule()), it fails pretty often with "Failed to start Job Supervisor actor: The name _ray_internal_job_actor_rayjob-sample-zwtkn (namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE) is already taken. Please use a different name or get the existing actor using ray.get_actor('_ray_internal_job_actor_rayjob-sample-zwtkn', namespace='SUPERVISOR_ACTOR_RAY_NAMESPACE')."

It might be related to ray-project/ray#31356, I'll try to prioritize and fix this one. It might be more important for the KubeRay use case, since I remember we have some issue where the same job is submitted twice by KubeRay.

architkulkarni · 2023-03-13T18:40:01Z

It failed 5/10 times with the above error. It seems the RayJob feature itself is flaky, not the test code. I think I should fix #756 first (using your idea of running ray job submit in the container commands) before merging this PR, to avoid making CI flaky.

…test-rayjob-sample-yaml

…test-rayjob-sample-yaml Signed-off-by: Archit Kulkarni <[email protected]>

.github/workflows/actions/configuration/action.yaml

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-04-11T23:34:23Z

@kevin85421 my fault, forgot to enable the new test, no need to review it now. I'll let you know when it's ready for review

Signed-off-by: Archit Kulkarni <[email protected]>

…lkarni/kuberay into test-rayjob-sample-yaml

This reverts commit 23d5644.

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-04-17T16:44:25Z

Hi @kevin85421 , I think the PR is ready to be merged now after your next review.

Passed 10/10 times locally
Passed in Github Actions CI
I removed the RayJobSuccessRule. We can't actually get the status of the job from the custom resource using ray job status, because currently we internally append random letters to the sample job name. If there's another way to get the status, we can add it later, or we can wait until after we refactor RayJob to be a k8s job.
The "SUCCEEDED" status is checked in the RayJobAddCREvent itself. (The event is not considered "converged" until the job has succeeded.)

kevin85421

LGTM

Add e2e test for sample RayJob yaml on kind

architkulkarni added 3 commits February 28, 2023 16:11

Add e2e test for sample RayJob yaml on kind

b5cc33a

Signed-off-by: Archit Kulkarni <[email protected]>

Remove unused Rule

6756db1

Signed-off-by: Archit Kulkarni <[email protected]>

Add back EasyJobRule to test cluster startup

a717ec6

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni requested a review from kevin85421 March 1, 2023 00:16

kevin85421 reviewed Mar 1, 2023

View reviewed changes

architkulkarni and others added 8 commits March 6, 2023 10:15

Update tests/framework/prototype.py

72fd359

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'test-rayjob-sample-yaml' of https://github.com/architku…

36d4760

…lkarni/kuberay into test-rayjob-sample-yaml

Specify 1 cpu in Sample YAML to deflake test

9512b33

Signed-off-by: Archit Kulkarni <[email protected]>

Fix resource specification in sample YAML

ef86d56

Signed-off-by: Archit Kulkarni <[email protected]>

Fix RayJobRule

f5113ef

Signed-off-by: Archit Kulkarni <[email protected]>

Add RayJobRule to test script

9c83d55

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/kuberay into …

6cf5e30

…test-rayjob-sample-yaml

Skip RayJobSuccessRule

6ae6309

Signed-off-by: Archit Kulkarni <[email protected]>

kevin85421 reviewed Mar 12, 2023

View reviewed changes

architkulkarni added 4 commits March 28, 2023 13:32

Merge branch 'master' of https://github.com/ray-project/kuberay into …

0c804d7

…test-rayjob-sample-yaml

Merge branch 'master' of https://github.com/ray-project/kuberay into …

c2658c6

…test-rayjob-sample-yaml

Merge branch 'master' of https://github.com/ray-project/kuberay into …

bdce2ef

…test-rayjob-sample-yaml

Merge branch 'master' of https://github.com/ray-project/kuberay into …

9654cd9

…test-rayjob-sample-yaml Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni commented Apr 11, 2023

View reviewed changes

.github/workflows/actions/configuration/action.yaml Outdated Show resolved Hide resolved

Update version to 0.5

43fba2d

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni requested a review from kevin85421 April 11, 2023 23:31

architkulkarni added 2 commits April 11, 2023 16:37

Add RayJobSuccessRule

23d5644

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'test-rayjob-sample-yaml' of https://github.com/architku…

74a7d40

…lkarni/kuberay into test-rayjob-sample-yaml

architkulkarni added 2 commits April 14, 2023 15:28

Revert "Add RayJobSuccessRule"

c67073b

This reverts commit 23d5644.

Delete RayJobSuccessRule

fc05572

Signed-off-by: Archit Kulkarni <[email protected]>

kevin85421 approved these changes Apr 17, 2023

View reviewed changes

kevin85421 merged commit ef290e0 into ray-project:master Apr 17, 2023

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Test] Add e2e test for sample RayJob yaml on kind (ray-project#935)

1501796

Add e2e test for sample RayJob yaml on kind

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] Add e2e test for sample RayJob yaml on kind #935

[Test] Add e2e test for sample RayJob yaml on kind #935

architkulkarni commented Mar 1, 2023 •

edited

Loading

architkulkarni commented Mar 1, 2023

kevin85421 left a comment

kevin85421 Mar 1, 2023

architkulkarni Mar 6, 2023

kevin85421 commented Mar 1, 2023

architkulkarni commented Mar 6, 2023

kevin85421 left a comment

kevin85421 Mar 12, 2023

architkulkarni Mar 13, 2023

kevin85421 Mar 13, 2023

architkulkarni Mar 13, 2023

kevin85421 Mar 12, 2023

architkulkarni Mar 13, 2023

kevin85421 commented Mar 12, 2023

architkulkarni commented Mar 13, 2023

architkulkarni commented Mar 13, 2023

architkulkarni commented Apr 11, 2023

architkulkarni commented Apr 17, 2023

kevin85421 left a comment

[Test] Add e2e test for sample RayJob yaml on kind #935

[Test] Add e2e test for sample RayJob yaml on kind #935

Conversation

architkulkarni commented Mar 1, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented Mar 1, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Mar 1, 2023

architkulkarni commented Mar 6, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Mar 12, 2023

architkulkarni commented Mar 13, 2023

architkulkarni commented Mar 13, 2023

architkulkarni commented Apr 11, 2023

architkulkarni commented Apr 17, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

architkulkarni commented Mar 1, 2023 •

edited

Loading