[Bug?][Documentation] nginx Ingress for client times out #729

tbukic · 2022-11-15T22:27:50Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

Wrapping up this discussion:

I'm still not sure if this is a bug or just me misunderstanding documentation, but when I try to access ray Client via Ingress client.[hostname]; command: ray.init("ray://client.[hostname]") I get timeout errors.

Reproduction script

For the comparison, this configuration renders dashboard just fine:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ .Values.name }}-ingress-dashboard 
  namespace: {{ .Values.namespace }}
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
    - host: dashboard.{{ .Values.hostRoot }}
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: {{ .Values.name }}-head-svc
                port:
                  name: dashboard

Configurations that fails are

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ .Values.name }}-ingress-client
  namespace: {{ .Values.namespace }}
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/server-snippet: |
      underscores_in_headers on;
      ignore_invalid_headers on;
spec:
  rules:
    - host: client.{{ .Values.hostRoot }}
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: {{ .Values.name }}-head-svc
                port:
                  name: client

and

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ .Values.name }}-ingress-client
  namespace: {{ .Values.namespace }}
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/server-snippet: |
      underscores_in_headers on;
      ignore_invalid_headers on;
spec:
  ingressClassName: nginx
  rules:
    - host: client.{{ .Values.hostRoot }}
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: {{ .Values.name }}-head-svc
                port:
                  number: 10001
  tls:
  - hosts:
      - client.{{ .Values.hostRoot }}
    secretName: {{ .Values.httpsSecretName }}

(and without tls as well :) ). My efforts up to now were inspired by this example - which is basically this paragraph from the documentation - and by @kevin85421 's comments on Slack.

Anything else

I'd prefer to access client in the form of client.[host], thus I don't follow existing example for exposing dashboard. Also, my dashboard works fine; as well as client when it's port-forwarded from either pod or service and accessed via localhost:[port]. I want maximally simplify access to client for DS colleagues who don't use k8s, so I'm trying to skip kubectl and port-forwarding.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2022-11-16T01:32:20Z

Thank @tbukic for pointing this out!

DmitriGekhtman · 2022-11-16T17:48:59Z

I want maximally simplify access to client for DS colleagues who don't use k8s, so I'm trying to skip kubectl and port-forwarding.

One pattern we've seen from users is

Setting up a Jupyter notebook environment in the K8s cluster
Connecting via Ray Client in the notebook
Exposing the Jupyter notebook to external traffic, which is a fairly common thing to do.

Related:
#725

DmitriGekhtman · 2022-11-16T17:54:52Z

Actually, the experience of setting up networking to expose Ray Client running in a K8s cluster was hell for the Ray maintainers at Anyscale.
For a data science platform, I wouldn't recommend going down the route of exposing Ray client directly.

DmitriGekhtman · 2022-11-16T20:55:03Z

@ckw017 are there any obvious fixes for the ingress config that you can see after a quick glance?

ckw017 · 2022-11-16T21:03:46Z

Hmm, nothing obvious stands out. The main gotcha we saw before was this, where reconfiguring ingresses would cause them to close active connections incorrectly.

One thing to check here is if the service that the ingress is pointing to has the ray client port exposed in its configuration, and to try connecting ray client through that service from another node inside that cluster (using the service's DNS). This would rule out problems with the service configuration.

tbukic · 2022-11-17T19:42:38Z

Thanks to everybody!

@ckw017 I've checked, it works:

if I connect to it using ray://[service_name].[namespace]:10001 from another pod
if I connect to service ADDRESS and IP from another pod obtained from pod's env
if I do port forwarding head's port 10001 and connect to ray://localhost:forwarded_port

Using ingress host still doesn't work.

If it means anything, I get multiple entries like this in ingress logs:

I1117 18:55:47.583755       1 status.go:299] "updating Ingress status" namespace="[namespace]" ingress="[client-ingress-pod-name]" currentValue=[] newValue=[{IP:[ingress-addr-1] Hostname: Ports:[]} {IP:[ingress-addr-2] Hostname: Ports:[]}]
I1117 18:55:47.597219       1 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"[namespace]", Name:"[client-ingress-pod-name]", UID:"[UID]", APIVersion:"networking.k8s.io/v1", ResourceVersion:"1096235631", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I1117 18:55:47.608717       1 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"[namespace]", Name:"[client-ingress-pod-name]", UID:"[UID]", APIVersion:"networking.k8s.io/v1", ResourceVersion:"1096235636", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

This is described current version of my ingress:

Name:             [release-name]-ingress-client
Labels:           app.kubernetes.io/managed-by=Helm
Namespace:        [namespace]
Address:          [ingress-addr-1],[ingress-addr-2]
Ingress Class:    nginx
Default backend:  <default>
TLS:
  [namespace]-tls terminates client.[host]
Rules:
  Host                                                      Path  Backends
  ----                                                      ----  --------
  client.[host]  
                                                            /   [service-name]:client ([correct-ip-of-head-node]:10001)
Annotations:                                                field.cattle.io/publicEndpoints:
                                                              [{"addresses":["[ingress-addr-1]","[ingress-addr-2]"],"port":443,"protocol":"HTTPS","serviceName":"[namespace]:[service-name]"...
                                                            ginx.ingress.kubernetes.io/rewrite-target: /
                                                            meta.helm.sh/release-name: [release-name]
                                                            meta.helm.sh/release-namespace: [namespace]
                                                            nginx.ingress.kubernetes.io/backend-protocol: GRPC
                                                            nginx.ingress.kubernetes.io/server-snippet:
                                                              underscores_in_headers on;
                                                              ignore_invalid_headers on;
Events:
  Type    Reason  Age                From                      Message
  ----    ------  ----               ----                      -------
  Normal  Sync    36m (x3 over 36m)  nginx-ingress-controller  Scheduled for sync
  Normal  Sync    36m (x3 over 36m)  nginx-ingress-controller  Scheduled for sync

@DmitriGekhtman , thank you for your reply! We're already thinking about Jupyter as a longerm solution. After seeing Kubeflow mentioned in #725 , I'm interested in exploring how it works with Kuberay as well. Both are bit longer goals, atm it'd be ideal if we'd have client and just

Temporary hack is exposed LoadBalancer to outside of k8s + hardcoded DNS name, but ingress is the next goal. I understand it may not be possible, or be very hard. I hope not having to care about authenticating users will make life easier for DS team scope.

ckw017 · 2022-11-17T19:53:33Z

Hm, shot in the dark but can you try pathType: ImplementationSpecific instead of pathType: Prefix

tbukic · 2022-11-17T20:48:12Z

Good idea, but still times me out. :/

ckw017 · 2022-11-17T21:08:06Z

Hmm, can you try with these annotations:

    # note: also try dropping rewrite-target
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/server-snippet: underscores_in_headers on; ignore_invalid_headers
      on;

I think if there's a way to sanity check if gRPC works in general (as opposed to Ray Client specifically) that would help narrow this down

jacobdanovitch · 2023-03-10T15:41:50Z

The issue is that ingress-nginx only supports secure GRPC (kubernetes/ingress-nginx#4095 (comment)). I was able to connect as a client using cert-manager with the following ingress definition:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    # some of these may not be strictly necessary, haven't tried
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/server-snippet: |
      underscores_in_headers on;
      ignore_invalid_headers on;
  name: ray-cluster
spec:
  ingressClassName: nginx 
  rules:
    - host: ray.example.com
      http:
        paths:
        - backend:
            service: 
              name: ray-head-svc
              port: 
                number: 10001
          path: /
          pathType: ImplementationSpecific
  tls:
    - hosts:
      - ray.example.com
      secretName: ray-cluster-cert

Once cert-manager provisions your certificate, get the cert and key:

k get secrets/ray-cluster-cert -o json | jq -r '.data."tls.crt"' | base64 -d > /tmp/ray.crt
k get secrets/ray-cluster-cert -o json | jq -r '.data."tls.key"' | base64 -d > /tmp/ray.key

Then you can make a secure connection in client mode:

import ray
import grpc

with open('/tmp/ray.key', 'rb') as key, open('/tmp/ray.crt', 'rb') as crt:
    credentials = grpc.ssl_channel_credentials(private_key=key.read(), certificate_chain=crt.read())

print('Connecting to Ray...')
ray.init(address='ray://ray.example.com:443', _credentials=credentials)
print('Connected.')
print(ray.available_resources())

kevin85421 · 2023-03-10T21:04:07Z

Thanks @jacobdanovitch! This is very helpful. I will add it into a document #955.

kevin85421 · 2023-04-24T21:58:41Z

@tedhtchang will take this issue.

[Note]:
As @DmitriGekhtman said (#729 (comment), #729 (comment)), KubeRay / Ray communities do not encourage users to expose Ray client directly. However, there are still some use cases where users need to expose the Ray client directly. In such cases, it would be beneficial to have documentation for the ingress setup and outlines the related pros and cons.

tedhtchang · 2023-04-24T22:05:58Z

@kevin85421 I can provide example and steps to setup Nginx Ingress controller on local Kind cluster for using with the ray.init().

…project#729)

tbukic added the bug Something isn't working label Nov 15, 2022

kevin85421 self-assigned this Nov 16, 2022

DmitriGekhtman added the P2 Important issue, but not time critical label Nov 16, 2022

DmitriGekhtman added P3 Nice-to-have, low urgency, won't-do until priority is increased and removed P2 Important issue, but not time critical labels Dec 9, 2022

kevin85421 mentioned this issue Mar 10, 2023

[Feature][Doc] Troubleshooting for networking #955

Open

2 tasks

kevin85421 assigned tedhtchang and unassigned kevin85421 Apr 24, 2023

tedhtchang added a commit to tedhtchang/kuberay that referenced this issue Apr 25, 2023

Connect Ray client with TLS using Nginx Ingress on Kind cluster (ray-…

f9d227a

…project#729)

tedhtchang mentioned this issue Apr 25, 2023

Connect Ray client with TLS using Nginx Ingress on Kind cluster (#729) #1051

Merged

4 tasks

tedhtchang added a commit to tedhtchang/kuberay that referenced this issue May 2, 2023

Connect Ray client with TLS using Nginx Ingress on Kind cluster (ray-…

8a734bd

…project#729)

kevin85421 closed this as completed in #1051 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug?][Documentation] nginx Ingress for client times out #729

[Bug?][Documentation] nginx Ingress for client times out #729

tbukic commented Nov 15, 2022

kevin85421 commented Nov 16, 2022

DmitriGekhtman commented Nov 16, 2022

DmitriGekhtman commented Nov 16, 2022 •

edited

Loading

DmitriGekhtman commented Nov 16, 2022

ckw017 commented Nov 16, 2022 •

edited

Loading

tbukic commented Nov 17, 2022 •

edited

Loading

ckw017 commented Nov 17, 2022

tbukic commented Nov 17, 2022

ckw017 commented Nov 17, 2022 •

edited

Loading

jacobdanovitch commented Mar 10, 2023

kevin85421 commented Mar 10, 2023

kevin85421 commented Apr 24, 2023

tedhtchang commented Apr 24, 2023

[Bug?][Documentation] nginx Ingress for client times out #729

[Bug?][Documentation] nginx Ingress for client times out #729

Comments

tbukic commented Nov 15, 2022

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Nov 16, 2022

DmitriGekhtman commented Nov 16, 2022

DmitriGekhtman commented Nov 16, 2022 • edited Loading

DmitriGekhtman commented Nov 16, 2022

ckw017 commented Nov 16, 2022 • edited Loading

tbukic commented Nov 17, 2022 • edited Loading

ckw017 commented Nov 17, 2022

tbukic commented Nov 17, 2022

ckw017 commented Nov 17, 2022 • edited Loading

jacobdanovitch commented Mar 10, 2023

kevin85421 commented Mar 10, 2023

kevin85421 commented Apr 24, 2023

tedhtchang commented Apr 24, 2023

DmitriGekhtman commented Nov 16, 2022 •

edited

Loading

ckw017 commented Nov 16, 2022 •

edited

Loading

tbukic commented Nov 17, 2022 •

edited

Loading

ckw017 commented Nov 17, 2022 •

edited

Loading