Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Stable Diffusion example #1181

Merged
merged 43 commits into from
Jun 23, 2023
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
922cd6c
stable diffusion
kevin85421 Jun 21, 2023
755d29d
update
kevin85421 Jun 21, 2023
2d2f693
add doc
kevin85421 Jun 22, 2023
2ed88aa
fix
kevin85421 Jun 22, 2023
0eb5db1
update mobilenet
kevin85421 Jun 22, 2023
7e6988f
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 22, 2023
cec4e96
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 22, 2023
969c500
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 22, 2023
2316503
Update ray-operator/config/samples/ray-service.stable-diffusion.yaml
kevin85421 Jun 22, 2023
942e280
Update ray-operator/config/samples/ray-service.stable-diffusion.yaml
kevin85421 Jun 22, 2023
479b4d0
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 22, 2023
4d018b2
update
kevin85421 Jun 22, 2023
fa89e62
update
kevin85421 Jun 22, 2023
5670005
update
kevin85421 Jun 22, 2023
cb91250
update
kevin85421 Jun 22, 2023
ebe49bd
update
kevin85421 Jun 22, 2023
bf4b911
update
kevin85421 Jun 22, 2023
59be339
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
04d9409
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
a3b4759
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
2914ed5
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
4c262e2
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
c8e7db9
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
6ed733d
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
24347d3
Update ray-operator/config/samples/ray-service.stable-diffusion.yaml
kevin85421 Jun 23, 2023
880e8dd
Update ray-operator/config/samples/ray-service.stable-diffusion.yaml
kevin85421 Jun 23, 2023
f64c7bc
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
06cccc4
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
08d3ecd
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
64d0bca
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
7edc49f
update
kevin85421 Jun 23, 2023
419da4b
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
3a672c6
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
60422bc
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
2f5ed4c
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
700f3cb
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
027ee9a
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
bb03165
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
f8d6191
Update docs/guidance/stable-diffusion-rayservice.md
kevin85421 Jun 23, 2023
afca969
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
788033f
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
81c82ca
Update docs/guidance/aws-eks-gpu-cluster.md
kevin85421 Jun 23, 2023
a071fe3
update
kevin85421 Jun 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/guidance/aws-eks-gpu-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Start Amazon EKS Cluster with GPUs for KubeRay

## Step 1: Create a Kubernetes cluster on Amazon EKS

Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#)
to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster.

## Step 2: Create node groups for the Amazon EKS cluster

Follow "Step 3: Create nodes" in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to create node groups. The following section provides more detailed information.

### Create a CPU node group

Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU
workers, such as the KubeRay operator, Ray head, and CoreDNS Pods.

Here's a common configuration that works for most KubeRay examples in the docs:
* Instance type: [**m5.xlarge**](https://aws.amazon.com/ec2/instance-types/m5/) (4 vCPU; 16 GB RAM)
* Disk size: 256 GB
* Desired size: 1, Min size: 0, Max size: 1

### Create a GPU node group

Create a GPU node group for Ray GPU workers.

1. Here's a common configuration that works for most KubeRay examples in the docs:
* AMI type: Bottlerocket NVIDIA (BOTTLEROCKET_x86_64_NVIDIA)
* Instance type: [**g5.xlarge**](https://aws.amazon.com/ec2/instance-types/g5/) (1 GPU; 24 GB GPU Memory; 4 vCPUs; 16 GB RAM)
* Disk size: 1024 GB
* Desired size: 1, Min size: 0, Max size: 1

2. **Please follow Step 4 to install the NVIDIA device plugin.**
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
* If you use `AMI type: Bottlerocket NVIDIA`, there is no need to install NVIDIA device plugin.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
* For other AMI types, you may need to install the NVIDIA device plugin DaemonSet in order to run GPU-enabled containers in your Amazon EKS cluster.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
If the GPU nodes have taints, add `tolerations` to `nvidia-device-plugin.yml` to enable the DaemonSet to schedule Pods on the GPU nodes."

3. Add a Kubernetes taint to prevent scheduling CPU Pods on this GPU node group. For KubeRay examples, add the following taint to the GPU nodes: `Key: ray.io/node-type, Value: worker, Effect: NoSchedule`, and include the corresponding `tolerations` for GPU Ray worker Pods.

> Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it.

## Step 3: Verify the node groups

> **Note:** If you encounter permission issues with `eksctl`, navigate to your AWS account's webpage and copy the
credential environment variables, including `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`,
from the "Command line or programmatic access" page.

```sh
eksctl get nodegroup --cluster ${YOUR_EKS_NAME}

# CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE
# ${YOUR_EKS_NAME} cpu-node-group ACTIVE 2023-06-05T21:31:49Z 0 1 1 m5.xlarge AL2_x86_64 eks-cpu-node-group-... managed
# ${YOUR_EKS_NAME} gpu-node-group ACTIVE 2023-06-05T22:01:44Z 0 1 1 g5.12xlarge BOTTLEROCKET_x86_64_NVIDIA eks-gpu-node-group-... managed
```

## Step 4: Install the DaemonSet for NVIDIA device plugin

> **Note:** If you encounter permission issues with `kubectl`, follow "Step 2: Configure your computer to communicate with your cluster"
in the [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#).

Install the DaemonSet for NVIDIA device plugin to run GPU enabled containers in your Amazon EKS cluster. You can refer to the [Amazon EKS optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami)
or [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) repository for more details.

```sh
# Install the DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml

# Verify that your nodes have allocatable GPUs
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

# Example output:
# NAME GPU
# ip-....us-west-2.compute.internal 4
# ip-....us-west-2.compute.internal <none>
```
5 changes: 3 additions & 2 deletions docs/guidance/mobilenet-rayservice.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# RayService: MobileNet example
# Serve a MobileNet image classifier using RayService

> **Note:** The Python files for the Ray Serve application and its client are in the repository [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples).

Expand All @@ -10,7 +10,8 @@ kind create cluster --image=kindest/node:v1.23.0

## Step 2: Install KubeRay operator

Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.
Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the nightly KubeRay operator via
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zcin

Helm. Note that the YAML file in Step 3 uses `serveConfigV2`, which is first supported by KubeRay v0.6.0.

## Step 3: Install a RayService

Expand Down
55 changes: 55 additions & 0 deletions docs/guidance/stable-diffusion-rayservice.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Serve a StableDiffusion text-to-image model using RayService

> **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repo
and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html).

## Step 1: Create a Kubernetes cluster with GPUs

Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) to create an AWS EKS cluster with 1
CPU (`m5.xlarge`) node and 1 GPU (`g5.xlarge`) node.

## Step 2: Install KubeRay operator
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved

Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the nightly KubeRay operator via
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
Helm. Note that the YAML file in Step 3 uses `serveConfigV2`, which is first supported by KubeRay v0.6.0.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved

## Step 3: Install a RayService

```sh
# path: ray-operator/config/samples/
kubectl apply -f ray-service.stable-diffusion.yaml
```

kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
* The `tolerations` for workers must match the taints on the GPU node group. Without the tolerations, worker Pods won't be scheduled on GPU nodes.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
```yaml
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
```
* Install `diffusers` in `runtime_env` as it is not included by default in the `ray-ml` image.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved

## Step 4: Forward the port of Serve

```sh
kubectl port-forward svc/stable-diffusion-serve-svc 8000
```

Note that the RayService's Kubernetes service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running.

## Step 5: Send a request to the text-to-image model

kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
```sh
# Step 5.1: Download `stable_diffusion_req.py`
curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/stable_diffusion/stable_diffusion_req.py

# Step 5.2: Update `prompt` in `stable_diffusion_req.py`.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved

# Step 5.3: Send a request to the Stable Diffusion model.
python stable_diffusion_req.py
# Check output.png
```

![image](../images/stable_diffusion_example.png)
Binary file added docs/images/stable_diffusion_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 15 additions & 13 deletions ray-operator/config/samples/ray-service.mobilenet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@ metadata:
spec:
serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
serveConfig:
importPath: mobilenet.mobilenet:app
runtimeEnv: |
working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip"
pip: ["python-multipart==0.0.6"]
serveConfigV2: |
applications:
- name: mobilenet
import_path: mobilenet.mobilenet:app
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip"
pip: ["python-multipart==0.0.6"]
rayClusterConfig:
rayVersion: '2.5.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
Expand All @@ -28,11 +30,11 @@ spec:
image: rayproject/ray-ml:2.5.0
resources:
limits:
cpu: 2
memory: 8Gi
cpu: 1
memory: 4Gi
requests:
cpu: 2
memory: 8Gi
cpu: 1
memory: 4Gi
ports:
- containerPort: 6379
name: gcs-server
Expand Down Expand Up @@ -65,8 +67,8 @@ spec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "2"
memory: "8Gi"
cpu: 1
memory: 4Gi
requests:
cpu: "2"
memory: "8Gi"
cpu: 1
memory: 4Gi
80 changes: 80 additions & 0 deletions ray-operator/config/samples/ray-service.stable-diffusion.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: stable-diffusion
spec:
serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
serveConfigV2: |
applications:
- name: stable_diffusion
import_path: stable_diffusion.stable_diffusion:entrypoint
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/d6acf9b99ef076a1848f506670e1290a11654ec2.zip"
pip: ["diffusers==0.12.1"]
rayClusterConfig:
rayVersion: '2.5.0' # Should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
# Pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.5.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "2"
memory: "8G"
requests:
cpu: "2"
memory: "8G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: gpu-group
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.5.0
resources:
limits:
cpu: 4
memory: "16G"
nvidia.com/gpu: 1
requests:
cpu: 3
memory: "12G"
nvidia.com/gpu: 1
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"