[RayService] Add support for multi-app config in yaml-string format #1156

zcin · 2023-06-09T18:55:57Z

Why are these changes needed?

The new field serveConfigV2 under a RayService spec takes a yaml-formatted string (https://yaml-multiline.info/) which matches a Ray Serve multi-app config.

Example from kubectl describe rayservice rayservice-sample:

Serve Config Type:     MULTI_APP
  serveConfigV2:         applications:
  - name: fruit_app
    import_path: fruit.deployment_graph
    route_prefix: /fruit
    runtime_env:
      working_dir: "https://github.com/ray-project/test_dag/archive/41d09119cbdf8450599f993f51318e9e27c59098.zip"
    deployments:
      - name: MangoStand
        num_replicas: 1
        user_config:
          price: 3
        ray_actor_options:
          num_cpus: 0.1
      - name: OrangeStand
        num_replicas: 1
        user_config:
          price: 2
        ray_actor_options:
          num_cpus: 0.1
      - name: PearStand
        num_replicas: 1
        user_config:
          price: 1
        ray_actor_options:
          num_cpus: 0.1
      - name: FruitMarket
        num_replicas: 1
        ray_actor_options:
          num_cpus: 0.1
      - name: DAGDriver
        num_replicas: 1
        ray_actor_options:
          num_cpus: 0.1
  - name: math_app
    import_path: conditional_dag.serve_dag
    route_prefix: /calc
    runtime_env:
      working_dir: "https://github.com/ray-project/test_dag/archive/41d09119cbdf8450599f993f51318e9e27c59098.zip"
    deployments:
      - name: Adder
        num_replicas: 1
        user_config:
          increment: 3
      - name: Multiplier
        num_replicas: 1
        user_config:
          factor: 5
      - name: Router
        num_replicas: 1
      - name: create_order
        num_replicas: 1
      - name: DAGDriver
        num_replicas: 1

  Service Unhealthy Second Threshold:  300
Status:
  Active Service Status:
    Application Statuses:
      fruit_app:
        Health Last Update Time:  2023-06-13T15:18:00Z
        Last Update Time:         2023-06-13T15:18:00Z
        Serve Deployment Statuses:
          fruit_app_DAGDriver:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_FruitMarket:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_MangoStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_OrangeStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_PearStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
        Status:                       RUNNING
      math_app:
        Health Last Update Time:  2023-06-13T15:18:00Z
        Last Update Time:         2023-06-13T15:18:00Z
        Serve Deployment Statuses:
          math_app_Adder:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_DAGDriver:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_Multiplier:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_Router:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_create_order:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
        Status:                       RUNNING

Notes:

After research online, there was no indication that Kubernetes has a maximum length for string fields. If there is one, it is likely that it is too large to affect us. E.g. the maximum size of any etcd request is 1.5 MiB.
Kubernetes schemas allow oneOf which is what we would need to ensure only one of serveConfig or serveConfigV2 is specified. However kubebuilder doesn't support generating oneOf (Support more complex validations (oneOf/anyOf/allOf/etc) kubernetes-sigs/controller-tools#461), so we perform validation in the service controller and consistently throw errors if user specifies both.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: cindyz <[email protected]>

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml

kevin85421 · 2023-06-12T05:10:30Z

ray-operator/controllers/ray/rayservice_controller.go

@@ -443,7 +443,7 @@ func (r *RayServiceReconciler) cleanUpServeConfigCache(rayServiceInstance *rayv1
 	configPrefix := r.generateConfigKeyPrefix(rayServiceInstance)

 	// Clean up RayCluster serve deployment configs.
-	for key := range r.ServeDeploymentConfigs.Items() {
+	for key := range r.ServeConfigs.Items() {


The logic of this function is quite confusing and seems to be causing a memory leak. To elaborate further, when a RayService is deleted, does the cache for the RayService instance get removed or not? I need to conduct some experiments to verify this. This is a technical debt issue and is not caused by this pull request. Therefore, we do not need to address this issue in the scope of this pull request.

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2023-06-12T06:03:14Z

ray-operator/controllers/ray/utils/dashboard_httpclient.go

+		ImportPath:  configV1Spec.ImportPath,
+		RuntimeEnv:  applicationRuntimeEnv,
+		Deployments: convertedDeploymentSpecs,
+		Port:        configV1Spec.Port,


Why do we add Port here? The behavior seems to be different from the following code snippet.

kuberay/ray-operator/controllers/ray/rayservice_controller.go

Lines 662 to 666 in 0ae9fc1

servingClusterDeployments := utils.ServingClusterDeployments{

ImportPath: rayServiceInstance.Spec.ServeDeploymentGraphSpec.ImportPath,

RuntimeEnv: runtimeEnv,

Deployments: rayDashboardClient.ConvertServeConfig(rayServiceInstance.Spec.ServeDeploymentGraphSpec.ServeConfigSpecs),

}

It's part of the spec. The lines you linked here were only used for printing out the json in logs, I think it just missed the port. In UpdateDeployments port was included before as well.

Based on my understanding, if we intend to make the port dynamically configurable, we will also need to update the configuration of Pods and Services. However, it is worth noting that RayCluster currently does not support dynamic updates to the configuration of Pods and Services."

I see. I believe this is existing functionality so I don't want to change it in this PR - can you create a separate issue for this?

I am fine with this change, but the corresponding function for this code snippet appears to be this instead of UpdateDeployments in dashboard_httpclient.go.

Yeah, if you look closely that code snippet is only used for print a json to the logs. It's not actually sent to the Serve API.

I will open an issue later and resolve this comment. However, it seems that the ports for containers in a Pod cannot be updated dynamically. See this link for more details. In this case, the port can only be updated if users expose multiple ports, of which only a few will be used. I am not familiar with serving, so I'm unsure if there are any use cases for this situation.

ray-operator/controllers/ray/utils/fake_serve_httpclient.go

Signed-off-by: cindyz <[email protected]>

ray-operator/apis/ray/v1alpha1/rayservice_types_test.go

ray-operator/config/samples/ray_v1alpha1_rayservice-multi-app.yaml

ray-operator/controllers/ray/utils/serve_api_models.go

kevin85421 · 2023-06-12T23:54:16Z

ray-operator/controllers/ray/utils/dashboard_httpclient.go

+		ImportPath:  configV1Spec.ImportPath,
+		RuntimeEnv:  applicationRuntimeEnv,
+		Deployments: convertedDeploymentSpecs,
+		Port:        configV1Spec.Port,


Based on my understanding, if we intend to make the port dynamically configurable, we will also need to update the configuration of Pods and Services. However, it is worth noting that RayCluster currently does not support dynamic updates to the configuration of Pods and Services."

ray-operator/controllers/ray/utils/fake_serve_httpclient.go

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2023-06-13T00:49:02Z

ray-operator/controllers/ray/rayservice_controller.go

+		cachedServeConfig, isServeConfig := cachedConfigObj.(rayv1alpha1.ServeDeploymentGraphSpec)
+		if !isServeConfig {
+			shouldUpdate = true
+			reason = fmt.Sprintf("No Serve config has been cached for cluster %s with key %s", rayClusterInstance.Name, cacheKey)


The message seems to be not precise enough. You can refer to https://go.dev/ref/spec#Type_assertions for more details about type assertions and the section "Implementing an interface" in https://go.dev/ref/spec#Method_sets about "implements".

More precisely, if T is not an interface type, x.(T) asserts that the dynamic type of x is identical to the type T. In this case, T must implement the (interface) type of x; otherwise the type assertion is invalid since it is not possible for x to store a value of type T. If T is an interface type, x.(T) asserts that the dynamic type of x implements the interface T.

In our case, cachedConfigObj (i.e. x) is an interface type, so isServeConfig=false means that ServeDeploymentGraphSpec does not implement the interface of cachedConfigObj. In this context, isServeConfig=false indicates that the serve configuration is cached, but the cache does not refer to the ServeDeploymentGraphSpec.

I see, I have updated the messages to include information about v1/v2 and the corresponding field types.

Signed-off-by: cindyz <[email protected]>

kevin85421

LGTM. I will approve this PR after I manually test it.

kevin85421 · 2023-06-14T20:28:33Z

ray-operator/controllers/ray/rayservice_controller.go

+			reason = fmt.Sprintf("No V1 Serve Config of type ServeDeploymentGraphSpec has been cached for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
+		} else if !utils.CompareJsonStruct(cachedServeConfig, rayServiceInstance.Spec.ServeDeploymentGraphSpec) {
+			shouldUpdate = true
+			reason = fmt.Sprintf("Current V2 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)


Suggested change

reason = fmt.Sprintf("Current V2 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)

reason = fmt.Sprintf("Current V1 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)

?

kevin85421

# Step 0: Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 1: Install a KubeRay operator (path: helm-chart/kuberay-operator/)
helm install kuberay-operator . --set image.repository=controller,image.tag=latest

# Step 2: Create a multi-app RayService (path: ray-operator/config/samples)
kubectl apply -f ray_v1alpha1_rayservice.yaml

# Step 3: Forward the serve port after the RayService is ready.
kubectl port-forward svc/rayservice-sample-serve-svc 8000

# Step 4: Test two applications
curl -X POST -H 'Content-Type: application/json' localhost:8000/fruit/ -d '["PEAR", 12]'
# expected output: 12
curl -X POST -H 'Content-Type: application/json' localhost:8000/calc/ -d '["ADD", 1]'
# expected output: "4 pizzas please!"

# Step 5: Edit the `rayVersion` to 2.100.0
kubectl edit rayservices.ray.io rayservice-sample

# Step 6: After the new RayCluster is ready, repeat Step 4.

…ay-project#1156) Add support for multi-app config in yaml-string format

add multi app support

0706af2

Signed-off-by: cindyz <[email protected]>

zcin force-pushed the multi-app-yaml branch from b3f3ae4 to 0706af2 Compare June 9, 2023 21:49

kevin85421 reviewed Jun 12, 2023

View reviewed changes

address comments

d8123fd

Signed-off-by: cindyz <[email protected]>

zcin marked this pull request as ready for review June 12, 2023 18:58

cindyz added 5 commits June 12, 2023 19:44

sync

502bcab

Signed-off-by: cindyz <[email protected]>

tests

3899aeb

Signed-off-by: cindyz <[email protected]>

fix

179799f

Signed-off-by: cindyz <[email protected]>

add comments

50b81ec

Signed-off-by: cindyz <[email protected]>

lint

69f4e55

Signed-off-by: cindyz <[email protected]>

kevin85421 reviewed Jun 13, 2023

View reviewed changes

cindyz added 2 commits June 13, 2023 15:30

address comments

8aef73c

Signed-off-by: cindyz <[email protected]>

serveconfigv2 only supports multi app

dce6748

Signed-off-by: cindyz <[email protected]>

zcin force-pushed the multi-app-yaml branch from 36bc052 to dce6748 Compare June 13, 2023 19:55

fix example

f3179bf

Signed-off-by: cindyz <[email protected]>

kevin85421 reviewed Jun 14, 2023

View reviewed changes

kevin85421 approved these changes Jun 14, 2023

View reviewed changes

kevin85421 merged commit 44c0d50 into ray-project:master Jun 14, 2023

This was referenced Jun 15, 2023

[RayService] Add support for multi-app configs #1155

Closed

[Service] Multi app support in Kuberay #985

Closed

kevin85421 mentioned this pull request Jun 29, 2023

[Umbrella] KubeRay benchmark plan #1208

Closed

2 tasks

zcin deleted the multi-app-yaml branch August 25, 2023 17:35

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[RayService] Add support for multi-app config in yaml-string format (r…

ee3af6b

…ay-project#1156) Add support for multi-app config in yaml-string format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] Add support for multi-app config in yaml-string format #1156

[RayService] Add support for multi-app config in yaml-string format #1156

zcin commented Jun 9, 2023 •

edited

Loading

kevin85421 Jun 12, 2023

kevin85421 Jun 12, 2023

zcin Jun 12, 2023

kevin85421 Jun 12, 2023

zcin Jun 13, 2023

kevin85421 Jun 13, 2023

zcin Jun 13, 2023

kevin85421 Jun 13, 2023

kevin85421 Jun 12, 2023

kevin85421 Jun 13, 2023

zcin Jun 13, 2023

kevin85421 left a comment

kevin85421 Jun 14, 2023

kevin85421 left a comment

	servingClusterDeployments := utils.ServingClusterDeployments{
	ImportPath: rayServiceInstance.Spec.ServeDeploymentGraphSpec.ImportPath,
	RuntimeEnv: runtimeEnv,
	Deployments: rayDashboardClient.ConvertServeConfig(rayServiceInstance.Spec.ServeDeploymentGraphSpec.ServeConfigSpecs),
	}

	reason = fmt.Sprintf("Current V2 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
	reason = fmt.Sprintf("Current V1 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)

[RayService] Add support for multi-app config in yaml-string format #1156

[RayService] Add support for multi-app config in yaml-string format #1156

Conversation

zcin commented Jun 9, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

zcin commented Jun 9, 2023 •

edited

Loading