Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. #1102

Merged
merged 5 commits into from
May 24, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 22, 2023

Why are these changes needed?

The no-monitor option in rayStartParams is used to determine whether the monitor.py process should be launched or not. The monitor process is not only for autoscaling but observability (e.g. Prometheus).

KubeRay autoscaling is enabled (i.e. EnableInTreeAutoscaling: true)

Ray autoscaler supports various node providers such as AWS, GCP, Azure, and Kubernetes. However, the default autoscaler is not compatible with Kubernetes. Therefore, we disable the monitor process by default via setting no-monitor to true and inject a sidecar container for KubeRay autoscaler.

if instance.Spec.EnableInTreeAutoscaling != nil && *instance.Spec.EnableInTreeAutoscaling {
headSpec.RayStartParams["no-monitor"] = "true"

KubeRay autoscaling is disabled

Since the default value of no-monitor in Ray is false when not explicitly specified, the monitor process, along with the Kubernetes-incompatible autoscaler, will be launched by default. Therefore, the purpose of this pull request is to initiate the monitor process without including the autoscaler.

@wuisawesome said that we can disable the autoscaling feature by making the autoscaler become READONLY.

https://sourcegraph.com/github.com/ray-project/ray@ac2230fd28f642e4523422a1e1eb15ab92f63943/-/blob/python/ray/autoscaler/_private/monitor.py?L220-L230

If autoscaling-config is not set in the head Pod's rayStartParams, BASE_READONLY_CONFIG will be used, which creates a READONLY autoscaler by default. Hence, in this PR, our focus is to prevent users from launching a KubeRay-incompatible autoscaler when they specify the autoscaling-config themselves.

Related issue number

Checks

  • I've made sure the tests are passing.

  • Testing Strategy

    • Unit tests
    • Manual tests
    • This PR is not tested :(
Screen Shot 2023-05-22 at 2 36 13 PM
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
helm install raycluster kuberay/ray-cluster --version 0.5.0

# Log in to the head Pod
kubectl exec -it $YOUR_HEAD_POD -- bash

# Check /tmp/ray/session_latest/logs/monitor.log in the head Pod
2023-05-22 13:23:39,259 INFO autoscaler.py:364 -- StandardAutoscaler: {'cluster_name': 'default', 'max_workers': 0, 'upscaling_speed': 1.0, 'docker': {}, 'idle_timeout_minutes': 0, 'provider': {'type': 'readonly', 'use_node_id_as_ip': True}, 'auth': {}, 'available_node_types': {'ray.head.default': {'resources': {}, 'node_config': {}, 'max_workers': 0}}, 'head_node_type': 'ray.head.default', 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': [], 'rsync_filter': [], 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': [], 'head_node': {}, 'worker_nodes': {}}
  • Note that StandardAutoscaler above is the same as BASE_READONLY_CONFIG if autoscaling-config is not set.

Set autoscaling-config

# Step 0: Build Docker image for this PR (controller:latest), and load into the Kind cluster.
# Step 1: Install the KubeRay operator with the new Docker image
 helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Step 2: Create a RayCluster with `autoscaling-config`
# Check this gist: https://gist.github.com/kevin85421/88315abaf9735722224e7787c9e54036

# Step 3: Check KubeRay operator log. You should be able to see the following log
# 2023-05-22T22:31:24.598Z        INFO    RayCluster-Controller   Detect autoscaling-config in head Pod's rayStartParams. The monitor process will initialize the monitor with the provided config. Please ensure the autoscaler is set to READONLY mode.

@kevin85421 kevin85421 changed the title [WIP] Add no-monitor by default Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. May 22, 2023
@wuisawesome
Copy link

+1 to leaving the monitor process in existence and disabling features like autoscaling if needed.

@kevin85421 I'm a little fuzzy on the overall integration with Ray. Could you summarize what the ray start command has (does it have a --no-monitor flag?) and where the monitor.py process is expected to be running (in the head node container or in the sidecar?) under different scenarios?

  1. The default with autoscaling disabled
  2. the default with autoscaling enabled
  3. If someone changes the rayStartParams on the head node with autoscaling enabled
  4. If someone changes the rayStartParams on the head node with autoscaling disabled

@kevin85421
Copy link
Member Author

@wuisawesome Thank you for the comment!

  • Case 1: The default with autoscaling disabled (EnableInTreeAutoscaling: false)

    • The value of no-monitor option is based on user's rayStartParams. The default value of no-monitor in Ray is false.
  • Case 2: The default with autoscaling enabled (EnableInTreeAutoscaling: true)

    • no-monitor will be set to "true" automatically. See here for more details.
  • Case 3: If someone changes the rayStartParams on the head node with autoscaling enabled.

    • no-monitor will be set to "true" automatically. See here for more details.
  • Case 4: If someone changes the rayStartParams on the head node with autoscaling disabled

    • The value of no-monitor option is based on user's rayStartParams. The default value of no-monitor in Ray is false.

@kevin85421 kevin85421 marked this pull request as ready for review May 22, 2023 23:17
@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

@Yicheng-Lu-llll
Copy link
Contributor

Yicheng-Lu-llll commented May 24, 2023

LGTM!

I'm a bit curious, though. Under what circumstances would a user need to configure autoscaling-config for KubeRay? It seems there is no need due to incompatible.

If a user do provide autoscaling-config , I am wondering how to set incompatible autoscaler to READONLY.

@kevin85421
Copy link
Member Author

@Yicheng-Lu-llll thank you for your comments!

I'm a bit curious, though. Under what circumstances would a user need to configure autoscaling-config for KubeRay? It seems there is no need due to incompatible.

Honestly, I do not know. I would prefer not to hardcode autoscaling-config here, considering the lesson we learned from the init container injection.

Based on the comments in ReadOnlyNodeProvider, the READONLY autoscaler also serves the purpose of collecting cluster state. Therefore, users may configure autoscaling-config for observability reasons. cc @wuisawesome @DmitriGekhtman for further inputs

If a user do provide autoscaling-config , I am wondering how to set incompatible autoscaler to READONLY.

https://sourcegraph.com/github.com/ray-project/ray@ac2230fd28f642e4523422a1e1eb15ab92f63943/-/blob/python/ray/autoscaler/_private/monitor.py?L102

Copy link

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. I almost wonder if we should take the more opinionated stance and require the --no-monitor flag at all times, but will defer to you on this.

@kevin85421
Copy link
Member Author

Discuss with @wuisawesome offline:

  • Consider running the monitor process as a sidecar container in any case.
  • The monitor process is responsible for Prometheus metrics, logs, "events," health check information, information to power Ray status, and autoscaling behavior.

@kevin85421
Copy link
Member Author

#1109

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…le autoscaler. (ray-project#1102)

Add a warning to discourage users from launching a KubeRay-incompatible autoscaler.
@kevin85421
Copy link
Member Author

More context about no-monitor: ray-project/ray#40911 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants