[Feature] Generic data abstraction on top of CRD #53

Jeffwan · 2021-10-03T05:29:02Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

In our system, not everyone is using kubectl to operate clusters directly. There're few major reasons.

Current ray operator is very friendly to users who is familiar with Kubernetes operator pattern. For most data scientists, this way actually increase their learning curve.
Using kubectl requires sophisticated permission system. I think some kubernetes cluster doesn't enable user level authentication. In my company, we use loose RBAC management and corp SSO system is not integrated with Kubernetes OIDC at all.

Due to above reason, I think it's worth to build some generic abstraction on top of RayCluster CRD. With the core api support, we can easily build backend services, CLI, etc to bridge users. Underneath, it still use Kubernetes to manage real data.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

/cc @chenk008 @akanso @chaomengyuan

The text was updated successfully, but these errors were encountered:

Jeffwan · 2021-10-03T06:20:25Z

API Definition

In order to better manage resources at the API level, a few proto files will be defined to describe resources. Technically, we can reuse Kubernetes resource directly. However, RayCluster CRD is probably not the best data structure to describe a cluster. At the same time, we want to leave some flexibility to leave some flexibilities to use database to store history data in the near future (for example, pagination etc).

message Cluster {
  // Output. Unique Cluster ID. Generated by API server.
  string id = 1;

  // Required input field. Unique Cluster name provided by user.
  string name = 2;

  // Required input field. Cluster's namespace provided by user
  string namespace = 3;

  // Required field. This field indicates the user who owns the cluster.
  string user = 4;

  // Optional input field. Ray cluster version
  string version = 5;

  // Optional field.
  enum Environment {
    DEV = 0;
    TESTING = 1;
    STAGING = 2;
    PRODUCTION = 3;
  }
  Environment environment = 6;

  // Required field. This field will be used to retrieve right ray container
  string cluster_runtime = 7;
  
  // Required field. This field 
  string compute_runtime = 8;

  // Output. The time that the Cluster created.
  google.protobuf.Timestamp created_at = 9;

  // Output. The time that the Cluster deleted.
  google.protobuf.Timestamp deleted_at = 10;
}

ComputeRuntime is equivalent to our header and worker pod template spec. Currently we only define some basic information, for rich feature like node affinity, tolerance etc. We have not include them yet.

message ComputeRuntime {
  string id = 1;
  string name = 2;
  enum Cloud {
      ALIBABA = 0;
      AWS = 1;
      AZURE = 2;
      GCP = 3;
      ON_PREM = 4;
  }
  Cloud cloud = 3;
  string region = 4;
  string availability_zone = 5;
  HeadGroupSpec head_group_spec = 6;
  repeated WorkerGroupSpec worker_group_sepc = 7;
}

message HeadGroupSpec {
  // Optional
  Resource resource = 1;
  // Optional
  map<string, string> ray_start_params = 2;
  // Optional
  string service_type = 3;
  // Output: internal/external service endpoint
  string service_address = 4;
}

message WorkerGroupSpec {
  // Optional input field.
  string group_name = 1;
  // Required input field
  int32 replicas = 2;
  // Optional
  int32 min_replicas = 3;
  // Optional
  int32 max_replicas = 4;
  // Optional
  Resource resource = 5;
  // Optional
  map<string, string> ray_start_params = 6;
}

ClusterRuntime is used to build node image. This is inspired by Anyscale. This is optional to some cluster, people can use base image + job level runtime as well.

message ClusterRuntime {
  string id = 1;
  string name = 2;
  string base_image = 3;
  repeated string pip_packages = 4;
  repeated string system_packages = 5;
  map<string, string> environment_variables = 6;
  string custom_commands = 7;
  // Output
  string image = 8;
}

Tech stack

.proto define core api, grpc and gateway services. go_client and swagger can be generated easily for further usage.

chaomengyuan · 2021-10-07T16:16:47Z

Just a small comment: the names "ClusterRuntime" and "ComputeRuntime" are a little bit confusing. For me, the actual definition of "ComputeRuntime" is more like a "ClusterRuntime".

Jeffwan · 2021-10-08T00:46:59Z

@chaomengyuan I think we can come up other ideas for images and try not to confuse user.

chenk008 · 2021-10-12T12:17:33Z

Just a small comment: the names "ClusterRuntime" and "ComputeRuntime" are a little bit confusing. For me, the actual definition of "ComputeRuntime" is more like a "ClusterRuntime".

I think so, too. ComputeRuntime is a little confused. ClusterRuntime is similar to runtime env in ray.

Jeffwan · 2021-10-17T00:49:10Z

I make some changes to API definition. /cc @chenk008 @chaomengyuan

Remove confusing ClusterRuntime. Use a simple string image instead.

provide a separate Image to build images. In the future, once we have workspace, we can reuse the image concept

Change ComputeRuntime to ClusterSpec. Create a reusable concept ComputeTemplate to describe resources (doesn't have ray information so it's reusable).

Please have a check. I am also thinking if we want to use reference like a foreign key or embed objects here. Since we don't use DB, we need to translate object to ConfigMap and then link everything together at cluster level.


message Cluster {
  // Output. Unique Cluster ID. Generated by API server.
  string id = 1;

  // Required input field. Unique Cluster name provided by user.
  string name = 2;

  // Required input field. Cluster's namespace provided by user
  string namespace = 3;

  // Required field. This field indicates the user who owns the cluster.
  string user = 4;

  // Optional input field. Ray cluster version
  string version = 5;

  // Optional field.
  enum Environment {
    DEV = 0;
    TESTING = 1;
    STAGING = 2;
    PRODUCTION = 3;
  }
  Environment environment = 6;

  // Required field. This field will be used to retrieve right ray container
  string image = 7;
  
  // Required field. This field indicates ray cluster
  ClusterSpec cluster_spec = 8;

  // Output. The time that the Cluster created.
  google.protobuf.Timestamp created_at = 9;

  // Output. The time that the Cluster deleted.
  google.protobuf.Timestamp deleted_at = 10;
}

message ClusterSpec {
  // The ID of the compute template
  string id = 1;
  // The name of the compute template
  string name = 2;
  // The head group configuration
  HeadGroupSpec head_group_spec = 3;
  // The worker group configurations
  repeated WorkerGroupSpec worker_group_spec = 4;
}

message HeadGroupSpec {
  // Optional
  ComputeTemplate compute_template = 1;
  // Optional
  map<string, string> ray_start_params = 2;
  // Optional
  string service_type = 3;
  // Output: internal/external service endpoint
  string service_address = 4;
}

message WorkerGroupSpec {
  // Optional input field.
  string group_name = 1;
  // Required input field
  int32 replicas = 2;
  // Optional
  int32 min_replicas = 3;
  // Optional
  int32 max_replicas = 4;
  // Optional
  ComputeTemplate compute_template = 5;
  // Optional
  map<string, string> ray_start_params = 6;
}

message ComputeTemplate {
  // The ID of the compute template
  string id = 1;
  // The ID of the compute template
  string name = 2;
  // Number of cpus
  uint32 cpu = 3;
  // Number of memory
  uint32 memory = 4;
  // Number of gpus
  uint32 gpu = 5;
  // The detail gpu accelerator type
  string gpu_accelerator = 6;
}

message Image {
  string id = 1;
  string name = 2;
  string base_image = 3;
  repeated string pip_packages = 4;
  repeated string conda_packages = 5;
  repeated string system_packages = 6;
  map<string, string> environment_variables = 7;
  string custom_commands = 8;
  // Output
  string image = 9;
}

Jeffwan · 2021-10-27T13:16:07Z

Let's split this story into separate sub issues

Core Cluster, Image message
gRPC and gRPC gateway services
Scripts to generate go clients and swagger files
code generation.

DmitriGekhtman · 2022-07-13T05:38:09Z

@Jeffwan
KubeRay API server has been implemented, so we can close?

Jeffwan added the enhancement New feature or request label Oct 3, 2021

Jeffwan mentioned this issue Oct 3, 2021

[Feature] Design a restful style backend services to handle cluster operations #54

Closed

2 tasks

Jeffwan mentioned this issue Oct 3, 2021

WIP: Add protobuf core apis #55

Closed

4 tasks

Jeffwan self-assigned this Oct 17, 2021

Jeffwan mentioned this issue Nov 8, 2021

Add core API and backend service design doc #98

Merged

4 tasks

Jeffwan added the apiserver label May 30, 2022

DmitriGekhtman closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Generic data abstraction on top of CRD #53

[Feature] Generic data abstraction on top of CRD #53

Jeffwan commented Oct 3, 2021 •

edited

Loading

Jeffwan commented Oct 3, 2021 •

edited

Loading

chaomengyuan commented Oct 7, 2021

Jeffwan commented Oct 8, 2021

chenk008 commented Oct 12, 2021 •

edited

Loading

Jeffwan commented Oct 17, 2021 •

edited

Loading

Jeffwan commented Oct 27, 2021 •

edited

Loading

DmitriGekhtman commented Jul 13, 2022

[Feature] Generic data abstraction on top of CRD #53

[Feature] Generic data abstraction on top of CRD #53

Comments

Jeffwan commented Oct 3, 2021 • edited Loading

Search before asking

Description

Are you willing to submit a PR?

Jeffwan commented Oct 3, 2021 • edited Loading

API Definition

Tech stack

chaomengyuan commented Oct 7, 2021

Jeffwan commented Oct 8, 2021

chenk008 commented Oct 12, 2021 • edited Loading

Jeffwan commented Oct 17, 2021 • edited Loading

Jeffwan commented Oct 27, 2021 • edited Loading

DmitriGekhtman commented Jul 13, 2022

Jeffwan commented Oct 3, 2021 •

edited

Loading

Jeffwan commented Oct 3, 2021 •

edited

Loading

chenk008 commented Oct 12, 2021 •

edited

Loading

Jeffwan commented Oct 17, 2021 •

edited

Loading

Jeffwan commented Oct 27, 2021 •

edited

Loading