Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change in helm chart 4.4.1 / controller 1.5.2 #9464

Closed
JVMartin opened this issue Dec 30, 2022 · 20 comments
Closed

Breaking change in helm chart 4.4.1 / controller 1.5.2 #9464

JVMartin opened this issue Dec 30, 2022 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@JVMartin
Copy link

JVMartin commented Dec 30, 2022

As soon as we automatically upgraded to 4.4.1, many of our services were inaccessible. Looked for common denominator, and discovered that Ingress definitions that use multiple paths/prefixes broke in strange and bizarre ways.

They would serve fake cert instead of cert-manager brokered cert, and if we bypassed the fake cert with e.g. curl --insecure, they would nginx 404 on all paths/prefixes.

Here is a sample broken Ingress (identifying information redacted):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-api
  namespace: test-production
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
  tls:
  - hosts:
    - test.com
    secretName: test-cert
  rules:
  - host: test.com
    http:
      paths:
      - path: /?(.*)
        pathType: Prefix
        backend:
          service:
            name: mirror
            port:
              number: 80
      - path: /v1/?(.*)
        pathType: Prefix
        backend:
          service:
            name: test-api
            port:
              number: 80

This is on DOKS v1.24.4.

This caused major outages for us across the board, and was extremely insidious to debug.

There were no errors in the ingress controller pods. No errors in the ingress descriptions (k describe ingress). No issue modifying the ingress definitions (k apply -f). No issues in the compiled nginx configs in the ingress controller pods. Because it was so insidious to debug this issue, we had a long downtime and huge impact.

Reverting to 4.4.0 immediately solved the issues.

@JVMartin JVMartin added the kind/bug Categorizes issue or PR as related to a bug. label Dec 30, 2022
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Dec 30, 2022
@rikatz
Copy link
Contributor

rikatz commented Dec 30, 2022

Reason: https://github.com/kubernetes/ingress-nginx/pull/9309/files

We had to do some proper validation on paths, and "?" character was left off. I will check with @strongjz and do a quick patch

@strongjz
Copy link
Member

/priority critical-urgent
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 30, 2022
@strongjz
Copy link
Member

#9465

This was referenced Dec 30, 2022
@JVMartin
Copy link
Author

JVMartin commented Dec 30, 2022

Thanks so much for the quick turnaround on this @rikatz @strongjz

I'm out of my depth, but just wanted to voice - even if this is a bug with e.g. the Ingress yaml/config parser thinking ? isn't a valid character in the regexp of a Prefix path definition, shouldn't we still have seen errors from somewhere when we were debugging? (Something like kubectl describe ingress saying "invalid path", or an error from kubectl logs on the ingress controller pods)?

Perhaps a separate issue would be to increase visibility into any potential similar bugs to this?

Or perhaps the "validation" part of the code thought it was okay, but some sanity check post-validation (deeper in the call stack) didn't and so there was just no visibility whatsoever into the underlying issue to me as a user, and so all that is needed is to fix the bug bc it isn't really possible to surface this kind of error to the user?

@rikatz
Copy link
Contributor

rikatz commented Dec 30, 2022

yes, this is actually a bigger problem we have on ingress, the feedback we provide for users is "poor" :)

we had some plans to add this feedback on events (describe wont show anything as we don't have status fields on Ingress).

The validation we could also do is on webhooks, but in your case the ingress object was already on apiserver, so it is "after" anything we could do.

It is still our plan to provide some better feedbacks to users on why some Ingress couldn't be reconciled, add some pre validation before entering the sync loop.

Thanks for raising this bug :D We are rolling back and re-planning how to release this, and also we will have some minor release soon allowing regex to be part only of "ImplementationSpecific" types

@strongjz
Copy link
Member

1.5.2 Github release was removed, and the helm chart has rolled forward to 4.4.2 with controller version 1.5.1

@JVMartin
Copy link
Author

JVMartin commented Dec 30, 2022

@rikatz Thanks for that insight.

I think the validations on the webhooks you describe must not exist (or otherwise help during this bug), because while we had 4.4.1 deployed, we were making small tweaks to the Ingress and applying them (kubectl apply -f ingress.yaml) and the apiserver was happily accepting them with no errors (though we couldn't get anything to actually work with regexp paths until we discovered the helm release and rolled back to 4.4.0).

Are you saying that the nginx ingress controller is no longer going to support regexp in Prefix path types? I'll have to research ImplementationSpecific type, thanks.

@JVMartin
Copy link
Author

Ohh sorry I misread - you said the validations we could also do, as in they don't exist yet. Got it.

@mycrEEpy
Copy link

mycrEEpy commented Jan 5, 2023

Don't know if it's a good idea but maybe you should also remove the container image for v1.5.2? Automation systems like Renovate will still pick up this new version and suggest the update to the user or worse, automatically roll it out due to being a patch version.

@plevart
Copy link

plevart commented Jan 5, 2023

So there currently isn't a release for kubernetes 1.26 available?

@longwuyuan
Copy link
Contributor

@plevart You can see that the CI tests for K8S v1.26 but yes, we need to wait for the next release

k8s: [v1.23.13, v1.24.7, v1.25.3, v1.26.0]

@longwuyuan
Copy link
Contributor

Don't know if it's a good idea but maybe you should also remove the container image for v1.5.2? Automation systems like Renovate will still pick up this new version and suggest the update to the user or worse, automatically roll it out due to being a patch version.

@mycrEEpy did you mean to remove from user environments or remove from k8s registry ?

@mycrEEpy
Copy link

mycrEEpy commented Jan 7, 2023

@longwuyuan I meant the k8s registry, as tools like Renovate scan the tags API of the registries. But as already said, this might be a big problem for people who are currently running v1.5.2 (assuming they don't have any problems with #9465) in their environments.

@longwuyuan
Copy link
Contributor

ok thanks. If you have the URL, can you try a docker pull and update here. We pulled back the release and may be I could try a docker pull myself too. But please help and do that docker pull yourself and update.

@mycrEEpy
Copy link

@longwuyuan The tag v1.5.2 is still listed for the registry k8s.gcr.io/ingress-nginx/controller:

~$ reg tags k8s.gcr.io/ingress-nginx/controller
INFO[0000] domain: k8s.gcr.io                           
INFO[0000] server address: k8s.gcr.io                   
sha256-34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143.sig
sha256-3870522ed937c9efb94bfa31a7eb16009831567a0d4cbe01846fc5486d622655.sig
sha256-4ba73c697770664c1e00e9f968de14e08f606ff961c76e5d7033a4a9c593c629.sig
sha256-54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974.sig
sha256-5516d103a9c2ecc4f026efbd4b40662ce22dc1f824fb129ed121460aaa5c47f8.sig
sha256-7059739637c30865f74cae403fffa55c2cb6d9779cd15654480dd0e4f850d536.sig
sha256-92115f5062568ebbcd450cd2cf9bffdef8df9fc61e7d5868ba8a7c9d773e0961.sig
sha256-d1707ca76d3b044ab8a28277a2466a02100ee9f58a86af1535a3edf9323ea1b5.sig
sha256-d8196e3bc1e72547c5dec66d6556c0ff92a23f6d0919b206be170bc90d5f9185.sig
v0.34.0
v0.34.1
.....
v1.4.0
v1.5.1
v1.5.2

@longwuyuan
Copy link
Contributor

show kubectl -n ingress-nginx get po -o yaml | grep -i image

@longwuyuan
Copy link
Contributor

@strongjz I think the GCP console shows only staging registry. Would you know of a way to remove the v1.5.2 layers !

@strongjz
Copy link
Member

Images once they are moved to production can not be removed. please make sure you are using the tag in TAG and/or the container digests in the Helm or static deploys.

@github-project-automation github-project-automation bot moved this from In Progress to Done in [SIG Network] Ingress NGINX Jan 10, 2023
@mycrEEpy
Copy link

mycrEEpy commented Jan 10, 2023

@strongjz That's what I thought. But even when using digests, automation tools will still find the broken v1.5.2 via the registry API and suggest updates to users. It's not your problem but, I just wanted to let you know.

@strongjz
Copy link
Member

Thanks for the feedback, it's unfortunate and I apologize for the inconvenience. We're working on improving the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants