Skip to content

Commit

Permalink
Add RayJob docs and development docs (#404)
Browse files Browse the repository at this point in the history
* Add RayJob docs and development docs

* Address code review feedbacks
  • Loading branch information
Jeffwan authored Jul 25, 2022
1 parent 8dcc8a7 commit de0b21c
Show file tree
Hide file tree
Showing 9 changed files with 229 additions and 44 deletions.
50 changes: 10 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
[![Build Status](https://github.com/ray-project/kuberay/workflows/Go-build-and-test/badge.svg)](https://github.com/ray-project/kuberay/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/ray-project/kuberay)](https://goreportcard.com/report/github.com/ray-project/kuberay)

KubeRay is an open source toolkit to run Ray applications on Kubernetes.

KubeRay provides several tools to improve running and managing Ray's experience on Kubernetes.
KubeRay is an open source toolkit to run Ray applications on Kubernetes. It provides several tools to improve running and managing Ray on Kubernetes.

- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Native Job and Serving integration with Clusters (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)

Expand All @@ -23,55 +21,27 @@ You can view detailed documentation and guides at [https://ray-project.github.io

### Use Yaml

#### Nightly version

```
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base"
```
Please choose the version you like to install. We will use nightly version `master` as an example

#### Stable version
| Version | Stable | Suggested Kubernetes Version |
|----------|:-------:|------------------------------:|
| master | N | v1.23 and above |
| v0.2.0 | Yes | v1.19 - 1.22 |

```
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"
export KUBERAY_VERSION=master
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}"
```

> Observe that we must use `kubectl create` to install cluster-scoped resources.
> The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271).
#### Single Namespace version

It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user
can use following commands.

```
# Nightly version
export KUBERAY_NAMESPACE=<my-awesome-namespace>
# executed by cluster admin
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f -
# executed by user
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f -
```

### Use helm chart

A helm chart is a collection of files that describe a related set of Kubernetes resources. It can help users to deploy ray-operator and ray clusters conveniently.
Please read [kuberay-operator](helm-chart/kuberay-operator/README.md) to deploy an operator and [ray-cluster](helm-chart/ray-cluster/README.md) to deploy a custom cluster.

### Monitor

We have add a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system.

You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install:
```shell
./install/prometheus/install.sh
```
It will set up the prometheus stack and deploy the related service monitor in `config/prometheus`

Then you can also use the json in `config/grafana` to generate the dashboards.

## Development

Please read our [CONTRIBUTING](CONTRIBUTING.md) guide before making a pull request. Refer to our [DEVELOPMENT](./ray-operator/DEVELOPMENT.md) to build and run tests locally.
Expand Down
14 changes: 14 additions & 0 deletions docs/deploy/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,17 @@ kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"

> Observe that we must use `kubectl create` to install cluster-scoped resources.
> The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271).
#### Single Namespace version

It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user
can use following commands.

```
# Nightly version
export KUBERAY_NAMESPACE=<my-awesome-namespace>
# executed by cluster admin
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f -
# executed by user
kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f -
```
57 changes: 57 additions & 0 deletions docs/development/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## KubeRay Development Guidance

Download this repo locally

```
mkdir -p $GOPATH/src/github.com/ray-project
cd $GOPATH/src/github.com/ray-project
git clone https://github.com/ray-project/kuberay.git
```

### Develop proto and OpenAPI

Generate go clients and swagger file

```
make generate
```

### Develop KubeRay Operator

```
cd ray-operator
# Build codes
make build
# Run test
make test
# Build container image
make docker-build
```

### Develop KubeRay APIServer

```
cd apiserver
# Build code
go build cmd/main.go
```

### Develop KubeRay CLI

```
cd cli
go build -o kuberay -a main.go
./kuberay help
```

### Deploy Docs locally

We don't need to configure `mkdocs` environment, to check static website locally, run command

```
docker run --rm -it -p 8000:8000 -v ${PWD}:/docs squidfunk/mkdocs-material build
```
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/guidance/gcs-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ metadata:
ray.io/external-storage-namespace: "my-raycluster-storage-namespace" # <- optional, to specify the external storage namespace
...
```
An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml)
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

When annotation `ray.io/ha-enabled` is added with a `true` value, KubeRay will enable Ray GCS HA feature. This feature
contains several components:
Expand Down Expand Up @@ -65,7 +65,7 @@ you need to add `RAY_REDIS_ADDRESS` environment variable to the head node templa

Also, you can specify a storage namespace for your Ray cluster by using an annotation `ray.io/external-storage-namespace`

An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml)
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

#### KubeRay Operator Controller

Expand Down
14 changes: 14 additions & 0 deletions docs/guidance/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Observability

### Monitor

We have added a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system.

You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install:

```shell
./install/prometheus/install.sh
```
It will set up the prometheus stack and deploy the related service monitor in `config/prometheus`

Then you can also use the json in `config/grafana` to generate the dashboards.
127 changes: 127 additions & 0 deletions docs/guidance/rayjob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
## Ray Job (alpha)

> Note: This is the alpha version of Ray Job Support in KubeRay. There will be ongoing improvements for Ray Job in the future releases.
### Prerequisite

* Ray 1.10 and above.
* KubeRay v0.3.0 or master

### What is a RayJob?

The RayJob is a new custom resource (CR) supported by KubeRay in v0.3.0.

A RayJob manages 2 things:
* RayCluster: Manages resources in kubernetes cluster.
* Job: Manages users' job in ray cluster.

### What does the RayJob provide?

* Kubernetes-native support for Ray cluster and Ray Job. You can use a kubernetes config to define a ray cluster and jobs in ray cluster. Then you can use `kubectl` to create the cluster and its job. The cluster can be deleted automatically after the job is finished.


### Deploy the KubeRay

Make sure KubeRay v0.3.0 version is deployed in your cluster.
For installation details, please check [guidance](../deploy/installation.md)

### Run an example Job

There is one example config file to deploy RayJob included here:
[ray_v1alpha1_rayjob.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml)

```shell
# Create a ray job.
$ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
```

```shell
# List running RayJobs.
$ kubectl get rayjob
NAME AGE
rayjob-sample 7s
```

```shell
# RayJob sample underneath will create a raycluster
# raycluster will create few resources including pods, services, you can type commands to have a check
$ kubectl get rayclusters
$ kubectl get pod
```

### RayJob Configuration

- `entrypoint` - The shell command to run for this job. job_id.
- `jobId` - Optional. Job ID to specify for the job. If not provided, one will be generated.
- `metadata` - Arbitrary user-provided metadata for the job.
- `runtimeEnv` - base64 string of the runtime json string.
- `shutdownAfterJobFinishes` - whether to recycle the cluster after job finishes.
- `ttlSecondsAfterFinished` - TTL to clean up the cluster. This is only working if `shutdownAfterJobFinishes` is set.

### RayJob Observability

You can use `kubectl logs` to check the operator logs or the head/worker nodes logs.
You can also use `kubectl describe rayjobs rayjob-sample` to check the states and event logs of your RayJob instance.

```
Status:
Dashboard URL: rayjob-sample-raycluster-vnl8w-head-svc.ray-system.svc.cluster.local:8265
End Time: 2022-07-24T02:04:56Z
Job Deployment Status: Complete
Job Id: test-hehe
Job Status: SUCCEEDED
Message: Job finished successfully.
Ray Cluster Name: rayjob-sample-raycluster-vnl8w
Ray Cluster Status:
Available Worker Replicas: 1
Endpoints:
Client: 32572
Dashboard: 32276
Gcs - Server: 30679
Last Update Time: 2022-07-24T02:04:43Z
State: ready
Start Time: 2022-07-24T02:04:49Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 90s rayjob-controller Created cluster rayjob-sample-raycluster-vnl8w
Normal Submitted 82s rayjob-controller Submit Job test-hehe
Normal Deleted 15s rayjob-controller Deleted cluster rayjob-sample-raycluster-vnl8w
```


If the job can not successfully run, you can see from the status as well.
```
Status:
Dashboard URL: rayjob-sample-raycluster-nrdm8-head-svc.ray-system.svc.cluster.local:8265
End Time: 2022-07-24T02:01:39Z
Job Deployment Status: Complete
Job Id: test-hehe
Job Status: FAILED
Message: Job failed due to an application error, last available logs:
python: can't open file '/tmp/code/script.ppy': [Errno 2] No such file or directory
Ray Cluster Name: rayjob-sample-raycluster-nrdm8
Ray Cluster Status:
Available Worker Replicas: 1
Endpoints:
Client: 31852
Dashboard: 32606
Gcs - Server: 32436
Last Update Time: 2022-07-24T02:01:30Z
State: ready
Start Time: 2022-07-24T02:01:38Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 2m9s rayjob-controller Created cluster rayjob-sample-raycluster-nrdm8
Normal Submitted 2m rayjob-controller Submit Job test-hehe
Normal Deleted 58s rayjob-controller Deleted cluster rayjob-sample-raycluster-nrdm8
```


### Delete the RayJob instance

```shell
$ kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml
```
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ KubeRay provides several tools to improve running and managing Ray's experience
- Ray Operator
- Backend services to create/delete cluster resources
- Kubectl plugin/CLI to operate CRD objects
- Native Job and Serving integration with Clusters
- Data Scientist centric workspace for fast prototyping (incubating)
- Native Job and Serving integration with Clusters (incubating)
- Kubernetes event dumper for ray clusters/pod/services (future work)
- Operator Integration with Kubernetes node problem detector (future work)

Expand Down
5 changes: 4 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,20 @@ nav:
- KubeRay CLI: components/cli.md
- Features:
- RayService: guidance/rayservice.md
- RayJob: guidance/rayjob.md
- Ray GCS HA: guidance/gcs-ha.md
- Autoscaling: guidance/autoscaler.md
- Ingress: guidance/ingress.md
- Observability: guidance/observability.md
- Best Practice:
- Worker reconnection: best-practice/worker-head-reconnection.md
- Troubleshooting:
- Guidance: troubleshooting.md
- Designs:
- Core API and Backend Service: design/protobuf-grpc-service.md
- Development:
- Release: release/README.md
- Development: development/development.md
- Release: development/release.md

# Customization
extra:
Expand Down

0 comments on commit de0b21c

Please sign in to comment.