From 5690fe93bd63a79d7cf0a516bfbf3670fc58eb45 Mon Sep 17 00:00:00 2001 From: "chenyu.jiang" Date: Thu, 1 Dec 2022 14:15:09 -0800 Subject: [PATCH 01/11] update docs --- apiserver/DEVELOPMENT.md | 58 ++++ apiserver/README.md | 123 +++++++- .../best-practice/worker-head-reconnection.md | 8 +- docs/design/protobuf-grpc-service.md | 275 ++++++------------ docs/guidance/gcs-ft.md | 2 +- docs/troubleshooting.md | 3 - 6 files changed, 263 insertions(+), 206 deletions(-) create mode 100644 apiserver/DEVELOPMENT.md diff --git a/apiserver/DEVELOPMENT.md b/apiserver/DEVELOPMENT.md new file mode 100644 index 0000000000..7e5f200728 --- /dev/null +++ b/apiserver/DEVELOPMENT.md @@ -0,0 +1,58 @@ +# User Guide + +This guide documents the purpose and deployment of kuberay-apiserver. + +## Requirements + +| software | version | link | +| :------- | :------: | ------------------------------------------------------------------: | +| kubectl | v1.18.3+ | [download](https://kubernetes.io/docs/tasks/tools/install-kubectl/) | +| go | v1.13+ | [download](https://golang.org/dl/) | +| docker | 19.03+ | [download](https://docs.docker.com/install/) | + +## Purpose +Lifecycle management of ray cluster may not be friendly for kubernetes nonexperts. +Backend service is intended to provide a RESTful web service to manage ray cluster kubernetes resource. + +## Build and Deployment +Backend service can be deployed locally, or in kubernetes cluster itself. The http service is listening on port 8888. + +### Pre-requisites +admin kube config file is located at ~/.kube/config + +### Local Deployment +#### Build +``` +go build -a -o raymgr cmd/main.go +``` + +#### Start Service +``` +./raymgr +``` +#### Access +localhost:8888 + +### Kubernetes Deployment +#### Build +``` +./docker-image-builder.sh +``` +This script will build and optionally push the image to the remote docker hub (hub.byted.org, TODO: make it configurable). +#### Start Service +``` +kubectl apply -f deploy/ +``` +#### Access +To get port + +``` +NODE_PORT=$(kubectl get -o jsonpath="{.spec.ports[0].nodePort}" services backend-service -n ray-system) +``` +To get node +``` +NODE_IP=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }') +``` +and pick any ip address + +Use NODE_IP:NODE_PORT to access \ No newline at end of file diff --git a/apiserver/README.md b/apiserver/README.md index d6e9be2cb2..0eea3bc10d 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -1,26 +1,137 @@ -# KubeRay ApiServer +# KubeRay APIServer -The KubeRay ApiServer provides gRPC and HTTP APIs to manage KubeRay resources. +The KubeRay APIServer provides gRPC and HTTP APIs to manage KubeRay resources. -!!! note +**Note** - The KubeRay ApiServer is an optional component. It provides a layer of simplified + The KubeRay APIServer is an optional component. It provides a layer of simplified configuration for KubeRay resources. The KubeRay API server is used internally by some organizations to back user interfaces for KubeRay resource management. - The KubeRay ApiServer is community-managed and is not officially endorsed by the + The KubeRay APIServer is community-managed and is not officially endorsed by the Ray maintainers. At this time, the only officially supported methods for managing KubeRay resources are - Direct management of KubeRay custom resources via kubectl, kustomize, and Kubernetes language clients. - Helm charts. - KubeRay ApiServer maintainer contacts (GitHub handles): + KubeRay APIServer maintainer contacts (GitHub handles): @Jeffwan @scarlet25151 ## Usage +You can just install the KubeRay APIServer within the same kubernetes cluster by using the [helm chart](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-apiserver) or just use [kustomize](https://github.com/ray-project/kuberay/tree/master/apiserver/deploy/base) + +After the deployment we may use the `{{baseUrl}}` to access the + +- (default) for nodeport access, we provide the default http port `31888` for connection and you can connect it using. + +- for ingress access, you will need to create your own ingress + +The requests parameters detail can be seen in [KubeRay swagger](https://github.com/ray-project/kuberay/tree/master/proto/swagger), here we only present some basic example: + +## Setup end-to-end test + +0. (optional) you may use your local kind cluster or minikube + +```bash +kind create cluster --name ray-test +``` + +1. Deploy the KubeRay APIServer within the same cluster of KubeRay operator + +```bash +helm -n ray-system install kuberay-apiserver kuberay/helm-chart/kuberay-apiserver +``` +or + +``` +kubectl apply -k "github.com/ray-project/kuberay/apiserver/deploy/base?ref=${KUBERAY_VERSION}&timeout=90s +``` + +2. you can access by your nodeport + +``` +curl localhost:31888 +{"code":5, "message":"Not Found"} +``` + +3. you can just create `RayCluster` or `RayJobs` or `RayService` by just dials the endpoints + +``` +curl -XPOST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ +--header 'Content-Type: application/json' \ +--data '{ + "name": "user-test-1", + "namespace": "default", + "user": "test", + "serveDeploymentGraphSpec": { + "importPath": "fruit.deployment_graph, + "runtimeEnv": "https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip\"\n", + "serveConfigs": [ + { + "deploymentName": "OrangeStand", + "replicas": 1, + "userConfig": "price: 2", + "actorOptions": { + "cpusPerActor": 0.1 + } + }, + { + "deploymentName": "PearStand", + "replicas": 1, + "userConfig": "price: 1", + "actorOptions": { + "cpusPerActor": 0.1 + } + }, + { + "deploymentName": "FruitMarket", + "replicas": 1, + "actorOptions": { + "cpusPerActor": 0.1 + } + },{ + "deploymentName": "DAGDriver", + "replicas": 1, + "routePrefix": "/", + "actorOptions": { + "cpusPerActor": 0.1 + } + }] + }, + "clusterSpec": { + "headGroupSpec": { + "computeTemplate": "default-template", + "image": "hub.byted.org/kuberay/ray:2.0.0", + "serviceType": "NodePort", + "rayStartParams": { + "port": "6379", + "node-ip-address": "$MY_POD_IP", + "dashboard-host": "0.0.0.0", + "metrics-export-port": "8080" + }, + "volumes": [] + }, + "workerGroupSpec": [ + { + "groupName": "small-wg", + "computeTemplate": "default-template", + "image": "hub.byted.org/kuberay/ray:2.0.0", + "replicas": 1, + "minReplicas": 0, + "maxReplicas": 5, + "rayStartParams": { + "node-ip-address": "$MY_POD_IP" + } + } + ] + } +}' +``` +then you can have the resource running in kubernetes cluster. + ### Compute Template #### Create compute templates in a given namespace diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md index 1229f55bc8..84a35c1beb 100644 --- a/docs/best-practice/worker-head-reconnection.md +++ b/docs/best-practice/worker-head-reconnection.md @@ -6,7 +6,7 @@ For a `RayCluster` with a head and several workers, if a worker is crashed, it w ## Explanation -When the head pod was deleted, it will be recreated with a new IP by KubeRay controller,and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in ‘ReconnectGcsServer’, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay: +When the head pod was deleted, it will be recreated with a new IP by KubeRay controller,and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in `ReconnectGcsServer`, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay: ``` /// The interval at which the gcs rpc client will check if gcs rpc server is ready. @@ -22,12 +22,10 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i ## Best Practice -GCS FT feature [#20498](https://github.com/ray-project/ray/issues/20498) is planned in Ray Core Roadmap. When this feature is released, expect a stable head and GCS such that worker-head connection lost issue will not appear anymore. +GCS FT feature is now alpha release, for further understand we can rely on the FT feature. To enable the GCS, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md) -Before that, to solve the workers-head connection lost, there are two options: +Also, to solve the workers-head connection lost, there are two others options: - Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head. - Make reconnection shorter: for version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to reduce the delay from 600s down to 20s before workers reconnect to the new head. - -> Note: we should update this doc when GCS FT feature gets updated. diff --git a/docs/design/protobuf-grpc-service.md b/docs/design/protobuf-grpc-service.md index 3d1e8ca108..1b4e9e0b33 100644 --- a/docs/design/protobuf-grpc-service.md +++ b/docs/design/protobuf-grpc-service.md @@ -8,11 +8,11 @@ There're few major blockers for users to use KubeRay Operator directly. - Using kubectl requires sophisticated permission system. Some kubernetes clusters do not enable user level authentication. In some companies, devops use loose RBAC management and corp SSO system is not integrated with Kubernetes OIDC at all. -Due to above reason, it's worth to build generic abstraction on top of RayCluster CRD. With the core api support, we can easily build backend services, cli, etc to bridge users without Kubernetes experiences to KubeRay. +Due to above reason, it's worth to build generic abstraction on top of RayCluster CRD. With the core API support, we can easily build backend services, cli, etc to bridge users without Kubernetes experiences to KubeRay. ## Goals -- The api definition should be flexible enough to support different kinds of clients (e.g. backend, cli etc). +- The APIs definition should be flexible enough to support different kinds of clients (e.g. backend, cli etc). - This backend service underneath should leverage generate clients to interact with existing RayCluster custom resources. - New added components should be plugable to existing operator. @@ -44,9 +44,11 @@ In order to better define resources at the API level, a few proto files will be - Some of the Kubernetes API like `tolerance` and `node affinity` are too complicated to be converted to an API. - We want to leave some flexibility to use database to store history data in the near future (for example, pagination, list options etc). -We end up propsing a simple and easy API which can cover most of the daily requirements. +We end up propsing a simple and easy API which can cover most of the daily requirements. -``` +For example, the protobuf definition of the `RayCluster`: + +```proto service ClusterService { // Creates a new Cluster. rpc CreateCluster(CreateClusterRequest) returns (Cluster) { @@ -104,11 +106,18 @@ message GetClusterRequest { message ListClustersRequest { // The namespace of the clusters to be retrieved. string namespace = 1; + } message ListClustersResponse { // A list of clusters returned. repeated Cluster clusters = 1; + + // The total number of clusters for the given query. + // int32 total_size = 2; + + // The token to list the next page of clusters. + // string next_page_token = 3; } message ListAllClustersRequest {} @@ -116,6 +125,12 @@ message ListAllClustersRequest {} message ListAllClustersResponse { // A list of clusters returned. repeated Cluster clusters = 1; + + // The total number of clusters for the given query. + // int32 total_size = 2; + + // The token to list the next page of clusters. + // string next_page_token = 3; } message DeleteClusterRequest { @@ -155,6 +170,18 @@ message Cluster { // Output. The time that the cluster deleted. google.protobuf.Timestamp deleted_at = 8; + + // Output. The status to show the cluster status.state + string cluster_state = 9; + + // Output. The list related to the cluster. + repeated ClusterEvent events = 10; + + // Output. The service endpoint of the cluster + map service_endpoint = 11; + + // Optional input field. Container environment variables from user. + map envs = 12; } message ClusterSpec { @@ -164,6 +191,33 @@ message ClusterSpec { repeated WorkerGroupSpec worker_group_spec = 2; } +message Volume { + string mount_path = 1; + enum VolumeType { + PERSISTENT_VOLUME_CLAIM = 0; + HOST_PATH = 1; + } + VolumeType volume_type = 2; + string name = 3; + string source = 4; + bool read_only = 5; + + // If indicate hostpath, we need to let user indicate which type + // they would like to use. + enum HostPathType { + DIRECTORY = 0; + FILE = 1; + } + HostPathType host_path_type = 6; + + enum MountPropagationMode { + NONE = 0; + HOSTTOCONTAINER = 1; + BIDIRECTIONAL = 2; + } + MountPropagationMode mount_propagation_mode = 7; +} + message HeadGroupSpec { // Optional. The computeTemplate of head node group string compute_template = 1; @@ -171,8 +225,10 @@ message HeadGroupSpec { string image = 2; // Optional. The service type (ClusterIP, NodePort, Load balancer) of the head node string service_type = 3; - // Optional. The ray start parames of head node group + // Optional. The ray start params of head node group map ray_start_params = 4; + // Optional. The volumes mount to head pod + repeated Volume volumes = 5; } message WorkerGroupSpec { @@ -190,201 +246,37 @@ message WorkerGroupSpec { int32 max_replicas = 6; // Optional. The ray start parames of worker node group map ray_start_params = 7; + // Optional. The volumes mount to worker pods + repeated Volume volumes = 8; } -service ComputeTemplateService { - // Creates a new compute template. - rpc CreateComputeTemplate(CreateComputeTemplateRequest) returns (ComputeTemplate) { - option (google.api.http) = { - post: "/apis/v1alpha2/compute_templates" - body: "compute_template" - }; - } - - // Finds a specific compute template by its name and namespace. - rpc GetComputeTemplate(GetComputeTemplateRequest) returns (ComputeTemplate) { - option (google.api.http) = { - get: "/apis/v1alpha2/namespaces/{namespace}/compute_templates/{name}" - }; - } - - // Finds all compute templates in a given namespace. Supports pagination, and sorting on certain fields. - rpc ListComputeTemplates(ListComputeTemplatesRequest) returns (ListComputeTemplatesResponse) { - option (google.api.http) = { - get: "/apis/v1alpha2/namespaces/{namespace}/compute_templates" - }; - } +message ClusterEvent { + // Output. Unique Event Id. + string id = 1; - // Finds all compute templates in all namespaces. Supports pagination, and sorting on certain fields. - rpc ListAllComputeTemplates(ListAllComputeTemplatesRequest) returns (ListAllComputeTemplatesResponse) { - option (google.api.http) = { - get: "/apis/v1alpha2/compute_templates" - }; - } + // Output. Human readable name for event. + string name = 2; - // Deletes a compute template by its name and namespace - rpc DeleteComputeTemplate(DeleteComputeTemplateRequest) returns (google.protobuf.Empty) { - option (google.api.http) = { - delete: "/apis/v1alpha2/namespaces/{namespace}/compute_templates/{name}" - }; - } -} + // Output. The creation time of the event. + google.protobuf.Timestamp created_at = 3; -message CreateComputeTemplateRequest { - // The compute template to be created. - ComputeTemplate compute_template = 1; - // The namespace of the compute template to be created - string namespace = 2; -} + // Output. The last time the event occur. + google.protobuf.Timestamp first_timestamp = 4; -message GetComputeTemplateRequest { - // The name of the ComputeTemplate to be retrieved. - string name = 1; - // The namespace of the compute template to be retrieved. - string namespace = 2; -} - -message ListComputeTemplatesRequest { - // The namespace of the compute templates to be retrieved. - string namespace = 1; - // TODO: support paganation later -} + // Output. The first time the event occur + google.protobuf.Timestamp last_timestamp = 5; -message ListComputeTemplatesResponse { - repeated ComputeTemplate compute_templates = 1; -} - -message ListAllComputeTemplatesRequest { - // TODO: support paganation later -} + // Output. The reason for the transition into the object's current status. + string reason = 6; -message ListAllComputeTemplatesResponse { - repeated ComputeTemplate compute_templates = 1; -} + // Output. A human-readable description of the status of this operation. + string message = 7; -message DeleteComputeTemplateRequest { - // The name of the compute template to be deleted. - string name = 1; - // The namespace of the compute template to be deleted. - string namespace = 2; -} - -// ComputeTemplate can be reused by any compute units like worker group, workspace, image build job, etc -message ComputeTemplate { - // The name of the compute template - string name = 1; - // The namespace of the compute template - string namespace = 2; - // Number of cpus - uint32 cpu = 3; - // Number of memory - uint32 memory = 4; - // Number of gpus - uint32 gpu = 5; - // The detail gpu accelerator type - string gpu_accelerator = 6; -} - - -service ImageTemplateService { - // Creates a new ImageTemplate. - rpc CreateImageTemplate(CreateImageTemplateRequest) returns (ImageTemplate) { - option (google.api.http) = { - post: "/apis/v1alpha2/image_templates" - body: "image_template" - }; - } - - // Finds a specific ImageTemplate by ID. - rpc GetImageTemplate(GetImageTemplateRequest) returns (ImageTemplate) { - option (google.api.http) = { - get: "/apis/v1alpha2/namespaces/{namespace}/image_templates/{name}" - }; - } - - // Finds all ImageTemplates. Supports pagination, and sorting on certain fields. - rpc ListImageTemplates(ListImageTemplatesRequest) returns (ListImageTemplatesResponse) { - option (google.api.http) = { - get: "/apis/v1alpha2/namespaces/{namespace}/image_templates" - }; - } - - // Deletes an ImageTemplate. - rpc DeleteImageTemplate(DeleteImageTemplateRequest) returns (google.protobuf.Empty) { - option (google.api.http) = { - delete: "/apis/v1alpha2/namespaces/{namespace}/image_templates/{name}" - }; - } -} - -message CreateImageTemplateRequest { - // The image template to be created. - ImageTemplate image_template = 1; - // The namespace of the image template to be created. - string namespace = 2; -} - -message GetImageTemplateRequest { - // The name of the image template to be retrieved. - string name = 1; - // The namespace of the image template to be retrieved. - string namespace = 2; -} - -message ListImageTemplatesRequest { - // The namespace of the image templates to be retrieved. - string namespace = 1; - // TODO: support pagingation later -} - -message ListImageTemplatesResponse { - // A list of Compute returned. - repeated ImageTemplate image_templates = 1; -} - -message ListAllImageTemplatesRequest { - // TODO: support pagingation later -} - -message ListAllImageTemplatesResponse { - // A list of Compute returned. - repeated ImageTemplate image_templates = 1; -} - -message DeleteImageTemplateRequest { - // The name of the image template to be deleted. - string name = 1; - // The namespace of the image template to be deleted. - string namespace = 2; -} - -// ImageTemplate can be used by worker group and workspce. -// They can be distinguish by different entrypoints -message ImageTemplate { - // The ID of the image template - string name = 1; - // The namespace of the image template - string namespace = 2; - // The base container image to be used for image building - string base_image = 3; - // The pip packages to install - repeated string pip_packages = 4; - // The conda packages to install - repeated string conda_packages = 5; - // The system packages to install - repeated string system_packages = 6; - // The environment variables to set - map environment_variables = 7; - // The post install commands to execute - string custom_commands = 8; - // Output. The result image generated - string image = 9; -} - -message Status { - string error = 1; - int32 code = 2; - repeated google.protobuf.Any details = 3; + // Output. Type of this event (Normal, Warning), new types could be added in the future + string type = 8; + + // Output. The number of times this event has occurred. + int32 count = 9; } ``` @@ -413,5 +305,6 @@ The service will implement gPRC server as following graph shows. ## Implementation History - 2021-11-25: inital proposal accepted. +- 2022-12-01: new protobuf definition released. > Note: we should update doc when there's a large update. diff --git a/docs/guidance/gcs-ft.md b/docs/guidance/gcs-ft.md index 38940da056..3932936348 100644 --- a/docs/guidance/gcs-ft.md +++ b/docs/guidance/gcs-ft.md @@ -1,4 +1,4 @@ -## Ray GCS Fault Tolerance(GCS FT) (Alpha Release) +## Ray GCS Fault Tolerance (GCS FT) (Alpha Release) > Note: This feature is in alpha release, there are a few limitations and stabilization will be done in future release from both Ray and KubeRay side. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 362c1146d1..898ee8f96f 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -55,9 +55,6 @@ In above cases, you will need to check if the client ray version is compatible w For example, when you deployed `kuberay/ray-operator/config/samples/ray-cluster.mini.yaml`, you need to be aware that `spec.rayVersion` and images version is the same with your expect ray release and same with your ray client version. ---- **NOTE:** _In ray code, the version check will only go through major and minor version, so the python and ray image's minor version match is enough. Also the ray upstream community provide different python version support from 3.6 to 3.9, you can choose the image to match your python version._ - ---- \ No newline at end of file From 9f1270ba032c4e47eb3bfda344d2b31d7ccbbaf6 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 15:43:37 -0800 Subject: [PATCH 02/11] Update docs/best-practice/worker-head-reconnection.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- docs/best-practice/worker-head-reconnection.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md index 84a35c1beb..8f25ffd68f 100644 --- a/docs/best-practice/worker-head-reconnection.md +++ b/docs/best-practice/worker-head-reconnection.md @@ -22,7 +22,7 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i ## Best Practice -GCS FT feature is now alpha release, for further understand we can rely on the FT feature. To enable the GCS, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md) +The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md) Also, to solve the workers-head connection lost, there are two others options: From bf0bc4fe00245cb2535e784efa43dbdb82ff06ec Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 15:44:06 -0800 Subject: [PATCH 03/11] Update docs/best-practice/worker-head-reconnection.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- docs/best-practice/worker-head-reconnection.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md index 8f25ffd68f..40ed82716c 100644 --- a/docs/best-practice/worker-head-reconnection.md +++ b/docs/best-practice/worker-head-reconnection.md @@ -24,7 +24,7 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md) -Also, to solve the workers-head connection lost, there are two others options: +To reduce the chances of a lost worker-head connection, there are two other options: - Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head. From b92c280dc0cbf9179f489c8d58e9d77c736d8090 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 15:44:13 -0800 Subject: [PATCH 04/11] Update docs/design/protobuf-grpc-service.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- docs/design/protobuf-grpc-service.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/protobuf-grpc-service.md b/docs/design/protobuf-grpc-service.md index 1b4e9e0b33..e2f52c9a9d 100644 --- a/docs/design/protobuf-grpc-service.md +++ b/docs/design/protobuf-grpc-service.md @@ -8,7 +8,7 @@ There're few major blockers for users to use KubeRay Operator directly. - Using kubectl requires sophisticated permission system. Some kubernetes clusters do not enable user level authentication. In some companies, devops use loose RBAC management and corp SSO system is not integrated with Kubernetes OIDC at all. -Due to above reason, it's worth to build generic abstraction on top of RayCluster CRD. With the core API support, we can easily build backend services, cli, etc to bridge users without Kubernetes experiences to KubeRay. +For the above reasons, it's worth it to build a generic abstraction on top of the RayCluster CRD. With the core API support, we can easily build backend services, cli, etc to bridge users without Kubernetes experience to KubeRay. ## Goals From fdff575d1e8429864cc1191b1c06488a5451c517 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 15:44:37 -0800 Subject: [PATCH 05/11] Update docs/design/protobuf-grpc-service.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- docs/design/protobuf-grpc-service.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/protobuf-grpc-service.md b/docs/design/protobuf-grpc-service.md index e2f52c9a9d..49c01738c5 100644 --- a/docs/design/protobuf-grpc-service.md +++ b/docs/design/protobuf-grpc-service.md @@ -12,7 +12,7 @@ For the above reasons, it's worth it to build a generic abstraction on top of th ## Goals -- The APIs definition should be flexible enough to support different kinds of clients (e.g. backend, cli etc). +- The API definition should be flexible enough to support different kinds of clients (e.g. backend, cli etc). - This backend service underneath should leverage generate clients to interact with existing RayCluster custom resources. - New added components should be plugable to existing operator. From 12ba008d9cb7f6666572aa23d2d92161061a7b52 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 15:44:47 -0800 Subject: [PATCH 06/11] Update docs/design/protobuf-grpc-service.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- docs/design/protobuf-grpc-service.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/protobuf-grpc-service.md b/docs/design/protobuf-grpc-service.md index 49c01738c5..b3bb41647d 100644 --- a/docs/design/protobuf-grpc-service.md +++ b/docs/design/protobuf-grpc-service.md @@ -44,7 +44,7 @@ In order to better define resources at the API level, a few proto files will be - Some of the Kubernetes API like `tolerance` and `node affinity` are too complicated to be converted to an API. - We want to leave some flexibility to use database to store history data in the near future (for example, pagination, list options etc). -We end up propsing a simple and easy API which can cover most of the daily requirements. +To resolve these issues, we provide a simple API which can cover most common use-cases. For example, the protobuf definition of the `RayCluster`: From 9126605b186087aa02fccf02430a2879df596b4a Mon Sep 17 00:00:00 2001 From: "chenyu.jiang" Date: Thu, 1 Dec 2022 15:49:50 -0800 Subject: [PATCH 07/11] remove dead codes and add roles in apiserver deployment --- apiserver/DEVELOPMENT.md | 2 +- apiserver/README.md | 19 +++++++------------ apiserver/deploy/base/apiserver.yaml | 1 + docs/design/protobuf-grpc-service.md | 12 ------------ docs/guidance/gcs-ft.md | 2 +- .../kuberay-apiserver/templates/role.yaml | 1 + 6 files changed, 11 insertions(+), 26 deletions(-) diff --git a/apiserver/DEVELOPMENT.md b/apiserver/DEVELOPMENT.md index 7e5f200728..15fe9cabda 100644 --- a/apiserver/DEVELOPMENT.md +++ b/apiserver/DEVELOPMENT.md @@ -38,7 +38,7 @@ localhost:8888 ``` ./docker-image-builder.sh ``` -This script will build and optionally push the image to the remote docker hub (hub.byted.org, TODO: make it configurable). +This script will build and optionally push the image to the remote docker hub (hub.byted.org). #### Start Service ``` kubectl apply -f deploy/ diff --git a/apiserver/README.md b/apiserver/README.md index 0eea3bc10d..27afe08841 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -44,13 +44,8 @@ kind create cluster --name ray-test ```bash helm -n ray-system install kuberay-apiserver kuberay/helm-chart/kuberay-apiserver ``` -or -``` -kubectl apply -k "github.com/ray-project/kuberay/apiserver/deploy/base?ref=${KUBERAY_VERSION}&timeout=90s -``` - -2. you can access by your nodeport +1. you can access by your nodeport ``` curl localhost:31888 @@ -64,11 +59,11 @@ curl -XPOST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ --header 'Content-Type: application/json' \ --data '{ "name": "user-test-1", - "namespace": "default", - "user": "test", + "namespace": "ray-system", + "user": "user", "serveDeploymentGraphSpec": { - "importPath": "fruit.deployment_graph, - "runtimeEnv": "https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip\"\n", + "importPath": "fruit.deployment_graph", + "runtimeEnv": "working_dir: \"https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip\"\n", "serveConfigs": [ { "deploymentName": "OrangeStand", @@ -104,7 +99,7 @@ curl -XPOST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ "clusterSpec": { "headGroupSpec": { "computeTemplate": "default-template", - "image": "hub.byted.org/kuberay/ray:2.0.0", + "image": "rayproject/ray:2.1.0", "serviceType": "NodePort", "rayStartParams": { "port": "6379", @@ -118,7 +113,7 @@ curl -XPOST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ { "groupName": "small-wg", "computeTemplate": "default-template", - "image": "hub.byted.org/kuberay/ray:2.0.0", + "image": "rayproject/ray:2.1.0", "replicas": 1, "minReplicas": 0, "maxReplicas": 5, diff --git a/apiserver/deploy/base/apiserver.yaml b/apiserver/deploy/base/apiserver.yaml index d9c6fcd59c..7652425c7a 100644 --- a/apiserver/deploy/base/apiserver.yaml +++ b/apiserver/deploy/base/apiserver.yaml @@ -84,6 +84,7 @@ rules: resources: - rayclusters - rayjobs + - rayservices verbs: - create - delete diff --git a/docs/design/protobuf-grpc-service.md b/docs/design/protobuf-grpc-service.md index b3bb41647d..a8ddaf6b20 100644 --- a/docs/design/protobuf-grpc-service.md +++ b/docs/design/protobuf-grpc-service.md @@ -112,12 +112,6 @@ message ListClustersRequest { message ListClustersResponse { // A list of clusters returned. repeated Cluster clusters = 1; - - // The total number of clusters for the given query. - // int32 total_size = 2; - - // The token to list the next page of clusters. - // string next_page_token = 3; } message ListAllClustersRequest {} @@ -125,12 +119,6 @@ message ListAllClustersRequest {} message ListAllClustersResponse { // A list of clusters returned. repeated Cluster clusters = 1; - - // The total number of clusters for the given query. - // int32 total_size = 2; - - // The token to list the next page of clusters. - // string next_page_token = 3; } message DeleteClusterRequest { diff --git a/docs/guidance/gcs-ft.md b/docs/guidance/gcs-ft.md index 3932936348..ade2fc1bb2 100644 --- a/docs/guidance/gcs-ft.md +++ b/docs/guidance/gcs-ft.md @@ -1,6 +1,6 @@ ## Ray GCS Fault Tolerance (GCS FT) (Alpha Release) -> Note: This feature is in alpha release, there are a few limitations and stabilization will be done in future release from both Ray and KubeRay side. +> Note: This feature is alpha. Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures without affecting important services such as detached Actors & RayServe deployments. diff --git a/helm-chart/kuberay-apiserver/templates/role.yaml b/helm-chart/kuberay-apiserver/templates/role.yaml index bc57e3cd53..41ba652cf2 100644 --- a/helm-chart/kuberay-apiserver/templates/role.yaml +++ b/helm-chart/kuberay-apiserver/templates/role.yaml @@ -12,6 +12,7 @@ rules: resources: - rayclusters - rayjobs + - rayservices verbs: - create - delete From 7ecfc88397dea44187961d2bc81e48b36d0f6229 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 18:51:19 -0800 Subject: [PATCH 08/11] Update apiserver/README.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- apiserver/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/apiserver/README.md b/apiserver/README.md index 27afe08841..fa391c149b 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -21,7 +21,7 @@ The KubeRay APIServer provides gRPC and HTTP APIs to manage KubeRay resources. ## Usage -You can just install the KubeRay APIServer within the same kubernetes cluster by using the [helm chart](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-apiserver) or just use [kustomize](https://github.com/ray-project/kuberay/tree/master/apiserver/deploy/base) +You can install the KubeRay APIServer by using the [helm chart](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-apiserver) or [kustomize](https://github.com/ray-project/kuberay/tree/master/apiserver/deploy/base) After the deployment we may use the `{{baseUrl}}` to access the From c7ef4437b91d8b0b722c98e6dc90b4ca156986e1 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 18:51:29 -0800 Subject: [PATCH 09/11] Update apiserver/README.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- apiserver/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/apiserver/README.md b/apiserver/README.md index fa391c149b..2ad7be00ba 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -52,7 +52,7 @@ curl localhost:31888 {"code":5, "message":"Not Found"} ``` -3. you can just create `RayCluster` or `RayJobs` or `RayService` by just dials the endpoints +3. you can create `RayCluster` or `RayJobs` or `RayService` by dialing the endpoints ``` curl -XPOST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ From b589902471e8f3350ff8ec1396b615c5d6f55073 Mon Sep 17 00:00:00 2001 From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> Date: Thu, 1 Dec 2022 18:51:39 -0800 Subject: [PATCH 10/11] Update apiserver/README.md Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com> --- apiserver/README.md | 48 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 39 insertions(+), 9 deletions(-) diff --git a/apiserver/README.md b/apiserver/README.md index 2ad7be00ba..5c9b30a31a 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -31,12 +31,44 @@ After the deployment we may use the `{{baseUrl}}` to access the The requests parameters detail can be seen in [KubeRay swagger](https://github.com/ray-project/kuberay/tree/master/proto/swagger), here we only present some basic example: -## Setup end-to-end test +### Setup end-to-end test -0. (optional) you may use your local kind cluster or minikube +0. (Optional) You may use your local kind cluster or minikube ```bash -kind create cluster --name ray-test +cat < Date: Fri, 2 Dec 2022 14:12:11 -0800 Subject: [PATCH 11/11] add explain for compute template --- apiserver/README.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/apiserver/README.md b/apiserver/README.md index 5c9b30a31a..cc6e1fb2ae 100644 --- a/apiserver/README.md +++ b/apiserver/README.md @@ -86,7 +86,16 @@ curl localhost:31888 3. You can create `RayCluster`, `RayJobs` or `RayService` by dialing the endpoints. The following is a simple example for creating the `RayService` object, follow [swagger support](https://ray-project.github.io/kuberay/components/apiserver/#swagger-support) to get the complete definitions of APIs. -``` +```shell +curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/compute_templates' \ +--header 'Content-Type: application/json' \ +--data '{ + "name": "default-template", + "namespace": "ray-system", + "cpu": 2, + "memory": 4 +}' + curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ --header 'Content-Type: application/json' \ --data '{ @@ -157,8 +166,15 @@ curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \ ``` The Ray resource will then be created in your Kubernetes cluster. +## Full definition of payload + ### Compute Template +For the purpose to simplify the setting of resource, we abstract the resource +of the pods template resource to the `compute template` for usage, you can +define the resource in the `compute template` and then choose the appropriate +template for your `head` and `workergroup` when you are creating the real objects of `RayCluster`, `RayJobs` or `RayService`. + #### Create compute templates in a given namespace ```