Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob] Add default CPU and memory for job submitter pod #1319

Merged
merged 7 commits into from
Oct 5, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Aug 11, 2023

Adds a default CPU and memory resource requirement for the pod that submits the RayJob.

image
Here's the memory and CPU usage of the job submitter pod with a job entrypoint script that prints "hello world" in a tight loop from Python. I tested it on a local kind cluster. Since the job submitter pod just calls ray job submit which runs the job remotely and streams the job output to stdout on the submitter pod, this kind of log streaming job should be the most stressful from the perspective of the job submitter pod.

Based on this, we set the requests to 500m CPU and 200MiB memory, and set the limits to 1 CPU and 1GiB memory.

Why are these changes needed?

It's good to provide reasonable defaults to reduce the burden on the user.

Related issue number

Closes #1198

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -112,6 +113,16 @@ func GetDefaultSubmitterTemplate(rayJobInstance *rayv1alpha1.RayJob) v1.PodTempl
{
Name: "ray-job-submitter",
Image: image,
Resources: v1.ResourceRequirements{
Limits: v1.ResourceList{
v1.ResourceCPU: resource.MustParse("1"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we select 1 CPU and 1 GB of memory rather than other values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a guess. Let me try to benchmark it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated using the benchmark result.

@@ -67,23 +67,3 @@ runs:
run: |
python tests/test_sample_raycluster_yamls.py
shell: bash

- name: Run tests for sample RayJob YAML files with the nightly operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove this one? Has this one been moved to Buildkite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this first: #1321 it adds it to buildkite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this has already been removed in some other PR.

@kevin85421 kevin85421 added the 1.0 label Sep 22, 2023
@kevin85421
Copy link
Member

This is necessary for Flyte. Let's merge this PR before v1.0.0-rc.1 cc @pingsutw

@architkulkarni
Copy link
Contributor Author

Got it, I'll prioritize the benchmarking for figure out how much CPU and memory to use. Until then, I guess Flyte can configure the submitterpodtemplate themselves, because we expose this in the RayJob yaml.

@kevin85421
Copy link
Member

Got it, I'll prioritize the benchmarking for figure out how much CPU and memory to use. Until then, I guess Flyte can configure the submitterpodtemplate themselves, because we expose this in the RayJob yaml.

Flyte provides a Python SDK for its Ray users. However, the current SDK does not expose submitterPodTemplate to users, potentially necessitating an API interface alteration to make submitterPodTemplate configurable. This modification could significantly impact the community; hence, providing a default value might be the most effective workaround for Flyte.

@architkulkarni
Copy link
Contributor Author

Tests are ok, buildkite/ray-ecosystem-ci-kuberay-ci/test-raycluster-sample-yamls-nightly-operator is expected to fail and is unrelated to this change.

@architkulkarni
Copy link
Contributor Author

@kevin85421 any other concerns about this PR?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you also update the doc in the Ray repo?

@architkulkarni architkulkarni merged commit 775715b into ray-project:master Oct 5, 2023
kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Oct 17, 2023
…t#1319)

Adds a default CPU and memory resource requirement for the pod that submits the RayJob.

image
Here's the memory and CPU usage of the job submitter pod with a job entrypoint script that prints "hello world" in a tight loop from Python. I tested it on a local kind cluster. Since the job submitter pod just calls ray job submit which runs the job remotely and streams the job output to stdout on the submitter pod, this kind of log streaming job should be the most stressful from the perspective of the job submitter pod.

Based on this, we set the requests to 500m CPU and 200MiB memory, and set the limits to 1 CPU and 1GiB memory.

Why are these changes needed?
It's good to provide reasonable defaults to reduce the burden on the user.

Related issue number
Closes ray-project#1198

---------

Signed-off-by: Archit Kulkarni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RayJob] Set Resources field for job submitter pod
2 participants