Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Pod to connect Ray client to remote a Ray cluster with TLS enabled #994

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 34 additions & 20 deletions docs/guidance/tls.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# TLS Authentication

Ray can be configured to use TLS on its gRPC channels. This means that
connecting to the Ray head will require an appropriate
set of credentials and also that data exchanged between various processes
Ray can be configured to use TLS on its gRPC channels. This means that
connecting to the Ray head will require an appropriate
set of credentials and also that data exchanged between various processes
(client, head, workers) will be encrypted ([Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html?highlight=tls#tls-authentication)).

This document provides detailed instructions for generating a public-private
key pair and CA certificate for configuring KubeRay.

> Warning: Enabling TLS will cause a performance hit due to the extra
overhead of mutual authentication and encryption. Testing has shown that
overhead of mutual authentication and encryption. Testing has shown that
this overhead is large for small workloads and becomes relatively smaller
for large workloads. The exact overhead will depend on the nature of your
workload.
Expand Down Expand Up @@ -43,12 +43,12 @@ kubectl apply -f ray-operator/config/samples/ray-cluster.tls.yaml

`ray-cluster.tls.yaml` will create:
* A Kubernetes Secret containing the CA's private key (`ca.key`) and self-signed certificate (`ca.crt`) (**Step 1**)
* A Kubernetes ConfigMap containing the scripts `gencert_head.sh` and `gencert_worker.sh`, which allow Ray Pods to generate private keys
* A Kubernetes ConfigMap containing the scripts `gencert_head.sh` and `gencert_worker.sh`, which allow Ray Pods to generate private keys
(`tls.key`) and self-signed certificates (`tls.crt`) (**Step 2**)
* A RayCluster with proper TLS environment variables configurations (**Step 3**)

The certificate (`tls.crt`) for a Ray Pod is encrypted using the CA's private key (`ca.key`). Additionally, all Ray Pods have the CA's public key included in `ca.crt`, which allows them to decrypt certificates from other Ray Pods.

# Step 1: Generate a private key and self-signed certificate for CA

In this document, a self-signed certificate is used, but users also have the
Expand Down Expand Up @@ -76,7 +76,7 @@ openssl x509 -in ca.crt -noout -text
kubectl create secret generic ca-tls --from-file=ca.key --from-file=ca.crt
```
* `ca.key`: CA's private key
* `ca.crt`: CA's self-signed certificate
* `ca.crt`: CA's self-signed certificate

This step is optional because the `ca.key` and `ca.crt` files have
already been included in the Kubernetes Secret specified in [ray-cluster.tls.yaml](../../ray-operator/config/samples/ray-cluster.tls.yaml).
Expand All @@ -85,12 +85,12 @@ already been included in the Kubernetes Secret specified in [ray-cluster.tls.yam

In [ray-cluster.tls.yaml](../../ray-operator/config/samples/ray-cluster.tls.yaml), each Ray
Pod (both head and workers) generates its own private key file (`tls.key`) and self-signed
certificate file (`tls.crt`) in its init container. We generate separate files for each Pod
because worker Pods do not have deterministic DNS names, and we cannot use the same
certificate file (`tls.crt`) in its init container. We generate separate files for each Pod
because worker Pods do not have deterministic DNS names, and we cannot use the same
certificate across different Pods.

In the YAML file, you'll find a ConfigMap named `tls` that contains two shell scripts:
`gencert_head.sh` and `gencert_worker.sh`. These scripts are used to generate the private key
In the YAML file, you'll find a ConfigMap named `tls` that contains two shell scripts:
`gencert_head.sh` and `gencert_worker.sh`. These scripts are used to generate the private key
and self-signed certificate files (`tls.key` and `tls.crt`) for the Ray head and worker Pods.
An alternative approach for users is to prebake the shell scripts directly into the docker image that's utilized
by the init containers, rather than relying on a ConfigMap.
Expand All @@ -99,13 +99,13 @@ Please find below a brief explanation of what happens in each of these scripts:
1. A 2048-bit RSA private key is generated and saved as `/etc/ray/tls/tls.key`.
2. A Certificate Signing Request (CSR) is generated using the private key file (`tls.key`)
and the `csr.conf` configuration file.
3. A self-signed certificate (`tls.crt`) is generated using the private key of the
3. A self-signed certificate (`tls.crt`) is generated using the private key of the
Certificate Authority (`ca.key`) and the previously generated CSR.

The only difference between `gencert_head.sh` and `gencert_worker.sh` is the `[ alt_names ]`
section in `csr.conf` and `cert.conf`. The worker Pods use the fully qualified domain name
(FQDN) of the head Kubernetes Service to establish a connection with the head Pod.
Therefore, the `[alt_names]` section for the head Pod needs to include the FQDN of the head
section in `csr.conf` and `cert.conf`. The worker Pods use the fully qualified domain name
(FQDN) of the head Kubernetes Service to establish a connection with the head Pod.
Therefore, the `[alt_names]` section for the head Pod needs to include the FQDN of the head
Kubernetes Service. By the way, the head Pod uses `$POD_IP` to communicate with worker Pods.

```sh
Expand All @@ -131,7 +131,7 @@ To enable TLS authentication in your Ray cluster, set the following environment

- `RAY_USE_TLS`: Either 1 or 0 to use/not-use TLS. If this is set to 1 then all of the environment variables below must be set. Default: 0.
- `RAY_TLS_SERVER_CERT`: Location of a certificate file which is presented to other endpoints so as to achieve mutual authentication (i.e. `tls.crt`).
- `RAY_TLS_SERVER_KEY`: Location of a private key file which is the cryptographic means to prove to other endpoints that you are the authorized user of a given certificate (i.e. `tls.key`).
- `RAY_TLS_SERVER_KEY`: Location of a private key file which is the cryptographic means to prove to other endpoints that you are the authorized user of a given certificate (i.e. `tls.key`).
- `RAY_TLS_CA_CERT`: Location of a CA certificate file which allows TLS to decide whether an endpoint’s certificate has been signed by the correct authority (i.e. `ca.crt`).

For more information on how to configure Ray with TLS authentication, please refer to [Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication).
Expand All @@ -142,8 +142,8 @@ For more information on how to configure Ray with TLS authentication, please ref
# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Since the head Pod has the certificate of $FQ_RAY_IP, the connection to the worker Pods
# will be established successfully, and the exit code of the ray health-check command
# Since the head Pod has the certificate of $FQ_RAY_IP, the connection to the worker Pods
# will be established successfully, and the exit code of the ray health-check command
# should be 0.
ray health-check --address $FQ_RAY_IP:6379
echo $? # 0
Expand All @@ -154,8 +154,22 @@ echo $? # 0
ray health-check --address $RAY_IP:6379

# If you add `DNS.3 = $RAY_IP` to the [alt_names] section in `gencert_head.sh`,
# the head Pod will generate the certificate of $RAY_IP.
#
# the head Pod will generate the certificate of $RAY_IP.
#
# For KubeRay versions prior to 0.5.0, this step is necessary because Ray workers in earlier
# versions use $RAY_IP to connect with Ray head.
```

# Step 5: Connect to the cluster with Ray client using TLS for interactive development
To learn more, please check [interactive development](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html#ray-client-interactive-development) and [TLS authentication](https://docs.ray.io/en/latest/ray-core/configure.html?highlight=tls#tls-authentication) for more detail.

For instructions on connecting the Ray cluster from a Pod:
```
# Create a client pod and connect to cluster
kubectl apply -f ray-operator/config/samples/ray-pod.tls.yaml
kubectl logs ray-client-tls
```
Verify the output similar to:
```
{'CPU': 2.0, 'node:10.254.20.20': 1.0, 'object_store_memory': 771128524.0, 'memory': 3000000000.0, 'node:10.254.16.25': 1.0}
```
60 changes: 60 additions & 0 deletions ray-operator/config/samples/ray-pod.tls.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
apiVersion: v1
kind: Pod
metadata:
name: ray-client-tls
namespace: default
spec:
containers:
- name: client
image: rayproject/ray:2.3.0
args: [/bin/sh, -c, 'python -c "import ray; ray.init(\"ray://$FQ_RAY_IP:10001\"); print(ray.cluster_resources())" && sleep infinity']
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: "/etc/ray/tls/tls.crt"
- name: RAY_TLS_SERVER_KEY
value: "/etc/ray/tls/tls.key"
- name: RAY_TLS_CA_CERT
value: "/etc/ca/tls/ca.crt"
# FQ_RAY_IP depends on ray-operator/config/samples/ray-cluster.tls.yaml
- name: FQ_RAY_IP
value: raycluster-tls-head-svc.default.svc.cluster.local
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
initContainers:
- name: gen-cert
image: rayproject/ray:2.3.0
args: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_worker.sh"]
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumes:
- name: gen-tls-script
configMap:
name: tls
defaultMode: 0777
items:
- key: gencert_worker.sh
path: gencert_worker.sh
- name: ca-tls
secret:
secretName: ca-tls
- name: ray-tls
emptyDir: {}