Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio and Knative problematic Gateway ports mapping in KF 1.4 #2082

Closed
kimwnasptd opened this issue Dec 6, 2021 · 3 comments · Fixed by #2092
Closed

Istio and Knative problematic Gateway ports mapping in KF 1.4 #2082

kimwnasptd opened this issue Dec 6, 2021 · 3 comments · Fixed by #2092

Comments

@kimwnasptd
Copy link
Member

After a handful of installations we bumped into an upstream Istio issue, that is triggered by Knative knative/serving#10160 (comment), regarding how the ports are opened in the Gateway Pod. Istio has created a hotfix istio/istio#33021 which was back-ported all the way to Istio 1.9.6, that comes along KF 1.4 https://github.com/istio/istio/commits/1.9.6.

The above issue, that gets triggered by Knative's Gateways and Services, can result in 404 from InfereceServices. We've seen users report 404s from KFServing even in the first RC of 1.4, but hadn't gotten to the bottom of it at that time #2007.

I'll provide some more technical details in the comments below on how to diagnose the specific issue, the root cause as well as how to fix it, but the key points I want to raise are:

  1. This is triggered by the ordering that Knative and Istio's resources are applied
  2. This affects all installations that are using manifests from Knative we provide, and can result in unusable installations

Because of the above, I'd like us to actually consider having a KF 1.4.1 release that will include the fix for the above problem knative-extensions/net-istio#636 in our manifests. cc @kubeflow/wg-manifests-leads @kubeflow/release-team

@kimwnasptd
Copy link
Member Author

How to diagnose

  1. Getting 404 errors when talking to InferenceServices
  2. The logs of the Gateway Pod, used by Knative local gateway, show warnings about duplicate listeners for 8081 [1]
  3. The listeners in the Gateway Pod, used by Knative local gateway, are not correct [2]
[1] Duplicate listener logs

These are the warnings around duplicate listeners knative/serving#10160. This is also what people in Knative reported

2021-12-02T11:08:06.418987Z	warning	envoy config	gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) 0.0.0.0_8081: duplicate listener 0.0.0.0_8081 found

[2] Incorrect Gateway listeners

The listeners below are the correct ones. If you see something different, then you've hit the issue we described

# find the proxy name
$ istioctl proxy-status | grep cluster-local-gateway

# Find the listeners
$ istioctl proxy-config listeners cluster-local-gateway-b76ff5885-qsrmk.istio-system

ADDRESS PORT  MATCH DESTINATION
0.0.0.0 8080  ALL   Route: http.80
0.0.0.0 8081  ALL   Route: http.8081
0.0.0.0 15021 ALL   Inline Route: /healthz/ready*
0.0.0.0 15090 ALL   Inline Route: /stats/prometheus*

@kimwnasptd
Copy link
Member Author

kimwnasptd commented Dec 6, 2021

Root cause

This bug was caused by:

  1. The fact that the ports in a Gateway don't correspond 1:1 to the Gateway Pod's ports. Istio is also using K8s Services to decide which ports to open in the underlying Gateway Pod
  2. We have both a K8s Service for Istio's cluster-local-gateway and a K8s Service for knative-local-gateway. The problem is that both of these are using port 80, and map to ports 8080 and 8081 respectively

There's also a design doc in Istio that fully describes this problem.

In some cases, depending on the Services creation order, istiod will decide that the port it should open for cluster-local-gateway Gateway is 8081, and not 8080. This is because both Services use port 80, so in some cases Istio will pick the Knative Service.

I'm also posting the picture from the above design doc, that really helps visualize this
Screenshot_20211202_165412

@kimwnasptd
Copy link
Member Author

kimwnasptd commented Dec 6, 2021

Solution

Istio introduced a new experimental.istio.io/disable-gateway-port-translation: "true" label istio/istio#33021 that tells Istio to not use a Service, when calculating which port to open in the Gateway Pod.

We will need to set this label to the knative-local-gateway Service, in order to avoid Istio from considering it when opening a port for the cluster-local-gateway Gateway, that uses port 80. This is also how Knative worked around this
knative/serving#10160
https://github.com/knative-sandbox/net-istio/pull/636/files#diff-4c605039e6b79864782eca911094b91ad9d98f8d1ada447ac8d4aa91d5d814a1

EDIT: Important detail, we are OK with the current version of Istio 1.9.6, since the required fix in Istio's side istio/istio#33021 was back-ported all the way to 1.9.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant