Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RayJob docs and development docs #404

Merged
merged 2 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 10 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
[![Build Status](https://github.com/ray-project/kuberay/workflows/Go-build-and-test/badge.svg)](https://github.com/ray-project/kuberay/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/ray-project/kuberay)](https://goreportcard.com/report/github.com/ray-project/kuberay)

KubeRay is an open source toolkit to run Ray applications on Kubernetes.

KubeRay provides several tools to improve running and managing Ray's experience on Kubernetes.
KubeRay is an open source toolkit to run Ray applications on Kubernetes. It provides several tools to improve running and managing Ray's experience on Kubernetes.
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved

- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Native Job and Serving integration with Clusters (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)

Expand All @@ -23,55 +21,27 @@ You can view detailed documentation and guides at [https://ray-project.github.io

### Use Yaml

#### Nightly version

```
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base"
```
Please choose the version you like to install. We will use nightly version `master` as an example

#### Stable version
| Version | Stable | Suggested Kubernetes Version |
|----------|:-------:|------------------------------:|
| master | N | v1.23 and above |
| v0.2.0 | Yes | v1.19 - 1.22 |

```
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"
export KUBERAY_VERSION=master
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}"
```

> Observe that we must use `kubectl create` to install cluster-scoped resources.
> The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271).

#### Single Namespace version

It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user
can use following commands.

```
# Nightly version
export KUBERAY_NAMESPACE=<my-awesome-namespace>
# executed by cluster admin
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f -
# executed by user
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f -

```

### Use helm chart

A helm chart is a collection of files that describe a related set of Kubernetes resources. It can help users to deploy ray-operator and ray clusters conveniently.
Please read [kuberay-operator](helm-chart/kuberay-operator/README.md) to deploy an operator and [ray-cluster](helm-chart/ray-cluster/README.md) to deploy a custom cluster.

### Monitor

We have add a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system.

You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install:
```shell
./install/prometheus/install.sh
```
It will set up the prometheus stack and deploy the related service monitor in `config/prometheus`

Then you can also use the json in `config/grafana` to generate the dashboards.

## Development

Please read our [CONTRIBUTING](CONTRIBUTING.md) guide before making a pull request. Refer to our [DEVELOPMENT](./ray-operator/DEVELOPMENT.md) to build and run tests locally.
Expand Down
14 changes: 14 additions & 0 deletions docs/deploy/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,17 @@ kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"

> Observe that we must use `kubectl create` to install cluster-scoped resources.
> The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271).

#### Single Namespace version

It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user
can use following commands.

```
# Nightly version
export KUBERAY_NAMESPACE=<my-awesome-namespace>
# executed by cluster admin
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f -
# executed by user
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f -
```
57 changes: 57 additions & 0 deletions docs/development/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## KubeRay Development Guidance

Download this repo locally

```
mkdir -p $GOPATH/src/github.com/ray-project
cd $GOPATH/src/github.com/ray-project
git clone https://github.com/ray-project/kuberay.git
```

### Develop proto and OpenAPI

Generate go clients and swagger file

```
make generate
```

### Develop KubeRay Operator

```
cd ray-operator

# Build codes
make build

# Run test
make test

# Build container image
make docker-build
```

### Develop KubeRay APIServer

```
cd apiserver

# Build code
go build cmd/main.go
```

### Develop KubeRay CLI

```
cd cli
go build -o kuberay -a main.go
./kuberay help
```

### Deploy Docs locally

We don't need to configure `mkdocs` environment, to check static website locally, run command

```
docker run --rm -it -p 8000:8000 -v ${PWD}:/docs squidfunk/mkdocs-material build
```
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/guidance/gcs-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ metadata:
ray.io/external-storage-namespace: "my-raycluster-storage-namespace" # <- optional, to specify the external storage namespace
...
```
An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml)
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

When annotation `ray.io/ha-enabled` is added with a `true` value, KubeRay will enable Ray GCS HA feature. This feature
contains several components:
Expand Down Expand Up @@ -65,7 +65,7 @@ you need to add `RAY_REDIS_ADDRESS` environment variable to the head node templa

Also, you can specify a storage namespace for your Ray cluster by using an annotation `ray.io/external-storage-namespace`

An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml)
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

#### KubeRay Operator Controller

Expand Down
14 changes: 14 additions & 0 deletions docs/guidance/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Observability

### Monitor

We have added a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system.

You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install:

```shell
./install/prometheus/install.sh
```
It will set up the prometheus stack and deploy the related service monitor in `config/prometheus`

Then you can also use the json in `config/grafana` to generate the dashboards.
129 changes: 129 additions & 0 deletions docs/guidance/rayjob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
## Ray Job (alpha)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain what the operational difference is between a Ray Service and a Ray Job.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some details. The cluster can be deleted by RayJob controller once it succeed or failed


> Note: This is the alpha version of Ray Job Support in KubeRay. There will be ongoing improvements for Ray Job in the future releases.

### Prerequisite

* Ray 1.10 and above.
* KubeRay v0.3.0 or master

### What is a RayJob?

The RayService is a new custom resource (CR) supported by KubeRay in v0.3.0.
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved

A RayService manages 2 things:
* RayCluster: Manages resources in kubernetes cluster.
* Ray Serve Deployment Graph: Manages users' serve deployment graph.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably link to the Ray docs to explain what a Ray Serve Deployment Graph is.
cc @simon-mo @brucez-anyscale

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally copy some contents and structure from RayService and forget to clean them up. Now it's in good shape


### What does the RayService provide?

* Kubernetes-native support for Ray cluster and Ray Serve deployment graphs. You can use a kubernetes config to define a ray cluster and its ray serve deployment graphs. Then you can use `kubectl` to create the cluster and its graphs.
* In-place update for ray serve deployment graph. Users can update the ray serve deployment graph config in the RayService CR config and use `kubectl apply` to update the serve deployment graph.
* Zero downtime upgrade for ray cluster. Users can update the ray cluster config in the RayService CR config and use `kubectl apply` to update the ray cluster. RayService will temporarily create a pending ray cluster, wait for the pending ray cluster ready, and then switch traffics to the new ray cluster, terminate the old cluster.
* Services HA. RayService will monitor the ray cluster and serve deployments health status. If RayService detects any unhealthy status lasting for a certain time, RayService will try to create a new ray cluster, and switch traffic to the new cluster when it is ready.

### Deploy the KubeRay

Make sure KubeRay v0.3.0 version is deployed in your cluster.
For installation details, please check [guidance](../deploy/installation.md)

### Run an example Job

There is one example config file to deploy RayJob included here:
[ray_v1alpha1_rayjob.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml)

```shell
# Create a ray job.
$ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
```

```shell
# List running RayServices.
$ kubectl get rayjob
NAME AGE
rayjob-sample 7s
```

```shell
# RayJob sample underneath will create a raycluster
# raycluster will create few resources including pods, services, you can type commands to have a check
$ kubectl get rayclusters
$ kubectl get pod
```

### RayJob Configuration

- `entrypoint` - The shell command to run for this job. job_id.
- `jobId` - Optional. Job ID to specify for the job. If not provided, one will be generated.
- `metadata` - Arbitrary user-provided metadata for the job.
- `runtimeEnv` - base64 string of the runtime json string.
- `shutdownAfterJobFinishes` - whether to recycle the cluster after job finishes.
- `ttlSecondsAfterFinished` - TTL to clean up the cluster. This is only working if `shutdownAfterJobFinishes` is set.

### RayJob Observability

You can use `kubectl logs` to check the operator logs or the head/worker nodes logs.
You can also use `kubectl describe rayjobs rayjob-sample` to check the states and event logs of your RayJob instance.

```
Status:
Dashboard URL: rayjob-sample-raycluster-vnl8w-head-svc.ray-system.svc.cluster.local:8265
End Time: 2022-07-24T02:04:56Z
Job Deployment Status: Complete
Job Id: test-hehe
Job Status: SUCCEEDED
Message: Job finished successfully.
Ray Cluster Name: rayjob-sample-raycluster-vnl8w
Ray Cluster Status:
Available Worker Replicas: 1
Endpoints:
Client: 32572
Dashboard: 32276
Gcs - Server: 30679
Last Update Time: 2022-07-24T02:04:43Z
State: ready
Start Time: 2022-07-24T02:04:49Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 90s rayjob-controller Created cluster rayjob-sample-raycluster-vnl8w
Normal Submitted 82s rayjob-controller Submit Job test-hehe
Normal Deleted 15s rayjob-controller Deleted cluster rayjob-sample-raycluster-vnl8w
```


If the job can not successfully run, you can see from the status as well.
```
Status:
Dashboard URL: rayjob-sample-raycluster-nrdm8-head-svc.ray-system.svc.cluster.local:8265
End Time: 2022-07-24T02:01:39Z
Job Deployment Status: Complete
Job Id: test-hehe
Job Status: FAILED
Message: Job failed due to an application error, last available logs:
python: can't open file '/tmp/code/script.ppy': [Errno 2] No such file or directory

Ray Cluster Name: rayjob-sample-raycluster-nrdm8
Ray Cluster Status:
Available Worker Replicas: 1
Endpoints:
Client: 31852
Dashboard: 32606
Gcs - Server: 32436
Last Update Time: 2022-07-24T02:01:30Z
State: ready
Start Time: 2022-07-24T02:01:38Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 2m9s rayjob-controller Created cluster rayjob-sample-raycluster-nrdm8
Normal Submitted 2m rayjob-controller Submit Job test-hehe
Normal Deleted 58s rayjob-controller Deleted cluster rayjob-sample-raycluster-nrdm8
```


### Delete the RayService instance

```shell
$ kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml
```
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ KubeRay provides several tools to improve running and managing Ray's experience
- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Native Job and Serving integration with Clusters (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)

Expand Down
5 changes: 4 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,20 @@ nav:
- KubeRay CLI: components/cli.md
- Features:
- RayService: guidance/rayservice.md
- RayJob: guidance/rayjob.md
- Ray GCS HA: guidance/gcs-ha.md
- Autoscaling: guidance/autoscaler.md
- Ingress: guidance/ingress.md
- Observability: guidance/observability.md
- Best Practice:
- Worker reconnection: best-practice/worker-head-reconnection.md
- Troubleshooting:
- Guidance: troubleshooting.md
- Designs:
- Core API and Backend Service: design/protobuf-grpc-service.md
- Development:
- Release: release/README.md
- Development: development/development.md
- Release: development/release.md

# Customization
extra:
Expand Down