Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many prometheus metrics disappeared after upgrade #3053

Closed
gjcarneiro opened this issue Sep 6, 2018 · 32 comments
Closed

Many prometheus metrics disappeared after upgrade #3053

gjcarneiro opened this issue Sep 6, 2018 · 32 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@gjcarneiro
Copy link

gjcarneiro commented Sep 6, 2018

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): no

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): grafana, prometheus


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.19

Kubernetes version (use kubectl version): 1.11.0

Environment:

  • Cloud provider or hardware configuration: baremetal
  • OS (e.g. from /etc/os-release): ubuntu 16.04
  • Kernel (e.g. uname -a):
  • Install tools: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.19.0
  • Others:

What happened:

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

What you expected to happen:

metrics shouldn't disappear on upgrade.

How to reproduce it (as minimally and precisely as possible):
Scrape the prometheus metrics endpoint:

$ http get http://<POD_IP>:10254/metrics | grep nginx_ingress_controller_bytes

The grep returns empty. I have many metrics, but many are also missing. Here are the metrics it is returning now:

metrics.txt

Anything else we need to know:
Container arguments:

        - args:
          - /nginx-ingress-controller
          - --default-backend-service=$(POD_NAMESPACE)/default-http-backend
          - --configmap=$(POD_NAMESPACE)/nginx-configuration
          - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
          - --udp-services-configmap=$(POD_NAMESPACE)/udp-services
          - --publish-service=$(POD_NAMESPACE)/ingress-nginx
          - --annotations-prefix=nginx.ingress.kubernetes.io
          - --enable-dynamic-configuration=false

And configmap:

    compute-full-forwarded-for: "true"
    disable-ipv6: "true"
    disable-ipv6-dns: "true"
    load-balance: ip_hash
    proxy-read-timeout: "3600"
    proxy-send-timeout: "3600"
    use-proxy-protocol: "true"
    worker-processes: "4"
    worker-shutdown-timeout: "43200"
@aledbf
Copy link
Member

aledbf commented Sep 6, 2018

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

This works as expected. The prometheus metrics are not persistent. That means when you upgrade the version, there are no stats. You need traffic to get data from prometheus

@aledbf aledbf closed this as completed Sep 6, 2018
@gjcarneiro
Copy link
Author

But I do have traffic. This nginx IC is used in production and is working fine. The upgrade was over 24 hours ago and still no metrics.

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

@aledbf
Copy link
Member

aledbf commented Sep 6, 2018

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

The metrics work with or without dynamic mode

@aledbf aledbf reopened this Sep 6, 2018
@gjcarneiro
Copy link
Author

I was able to reproduce in the dev environment. Downgrade to 0.18 -> it works. Go back to 0.19 -> metrics gone.

@danielfm
Copy link

I hit this as well.

I was running 0.17.1 and upgraded to 0.19.0, and several metrics apparently stopped being reported by the metrics endpoint. As mentioned by @gjcarneiro, downgrading to 0.18.0 also restored the lost metrics on my end (I used the latest nginx-ingress chart version with both).

Metrics dump for both versions (some labels were redacted):
https://gist.github.com/danielfm/d429b8fa055671d6fccb1ee1c1863ab9

I checked the changelog between 0.18.0 and 0.19.0, but could not find any change that would explain this, so any help is greatly appreciated.

@sergelogvinov
Copy link

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

1 similar comment
@opskumu
Copy link

opskumu commented Sep 18, 2018

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

@davidcodesido
Copy link

Same here. New installation on 0.19.0 and I've been struggling through all the examples of metrics online without seeing them in my nginx-ingress controller, now I've seen this ticket.

@opskumu
Copy link

opskumu commented Sep 21, 2018

I used a custom template configuration nginx.tmpl with 0.18.0, when upgrade from 0.18.0 --> 0.19.0, template not updated to new. After update nginx.tmpl to 0.19.0, metrics work ok.

@danielfm
Copy link

I'm not using custom templates in my configuration. In my case, as I said before, I tried the exact same configuration with the exact same nginx-ingress chart version both with v0.18.0 and v0.19.0, and several metrics stopped being reported in v0.19.0 (see the GitHub gist I posted earlier for more information).

@rlees85
Copy link

rlees85 commented Oct 12, 2018

I've just hit this too....

Nginx Ingress: 0.20.0
Kubernetes: 1.9

Any work around?

edit: using standard templates, but with custom snippets
further edit: I am getting the ingress metrics, but NO Nginx metrics, specially 2xx 3xx 4xx 5xx codes

@BlueBlue-Lee
Copy link

I have the same problem in our production environment. Prometheus can't collect many metrics.

nginx-ingress-controller:0.20.0
kubernetes v1.10.0
--enable-dynamic-configuration=false

Also enable-dynamic-configuration shouldn't have any relation with metrics, but it does seem have.

@aledbf @ElvinEfendi @nicksardo

@rlees85
Copy link

rlees85 commented Oct 19, 2018

I'm using the default setting for enable-dynamic-configuration which I believe is true. I also tried setting it to false. Couldn't get all the metrics either way on 0.19.0 and above.

@bernardoVale
Copy link

Same issue here on 0.20.0 and dev

@bernardoVale
Copy link

it works fine on 0.18.0

@paalkr
Copy link
Contributor

paalkr commented Oct 23, 2018

I have the same issue with 0.19.0 and 0.20.0, No problem with 0.18.0

@ElvinEfendi
Copy link
Member

Is everyone missing only Nginx metrics and have no issue with controller metrics? If not is there a pattern what metrics are missing?

Are you using custom template?

Do you see one or more of the following messages in the logs?
error when setting up timer.every
omitting metrics for the request, current batch is full
error while encoding metrics

Do you see any other Nginx error in the logs?

Can you strace a Nginx worker and see whether it's writing to unix:/tmp/prometheus-nginx.socket and what's it writing (it should be Nginx metrics such HTTP status, full list is at

local function metrics()
)?

Can you also strace controller process and see whether it's reading from the same socket?

@Thubo
Copy link

Thubo commented Oct 31, 2018

I'm hitting the same issue on Version 0.20.0:

With --enable-dynamic-configuration=false the metrics are missing, while with --enable-dynamic-configuration=true metrics are exported as expected.

@sczizzo
Copy link

sczizzo commented Nov 1, 2018

@ElvinEfendi I think I can provide some more context.

We noticed this after upgrading from chart version 0.17.2 to 0.29.1. We do indeed have --enable-dynamic-configuration=false set after upgrading to 0.29.1. We did not have that set previously, but this option was already false by default in 0.17.2.

(I don't think v0.17.2 or 0.29.1 is particularly important, it's just what we had deployed. As others have noted, this problem seems to have first appeared in chart version 0.19 and only with dynamic configuration disabled.)

So I took a snapshot of /metrics on both versions:

I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt

So to answer your first set of questions, we're definitely getting some metrics. Also appears the naming scheme changed a bit between these releases.

We used to be able to, say, get the total number of responses by server_zone like so:

sum(rate(nginx_responses_total{job="nginx-ingress"}[5m])) by (server_zone)

But I don't see any way to accomplish this given the new set of metrics.

As to your second set of questions, we did not see any instance of error when setting up timer.every, but we did see lots of omitting metrics for the request, current batch is full error while encoding metrics. As far as I can tell, with dynamic configuration disabled, monitor.init_worker() is never called, so we wouldn't expect to see that timer.every error (perhaps init_worker should be running in this case?). Otherwise no errors really jump out at me.

I've haven't spent too much time digging into the running containers yet, but with chart 0.17.2 there actually is no /tmp/prometheus-nginx.socket; apparently that was added later. With v0.29.1 I was able to run strace on both the nginx process and the controller process in addition to nc -U; didn't see any sign that socket was written.

@begemotik
Copy link

begemotik commented Nov 23, 2018

We experience the same issue here with 0.19.0 version, with custom nginx.tmpl we need that enable-dynamic-configuration was explicitly set to false, but it leads to lack of metrics. Is there any work on that going on?

@gjcarneiro
Copy link
Author

From the changelog, it seems that --enable-dynamic-configuration=false will disappear as option, and dynamic configuration will always be enabled.

To be honest, I've had a poor experience of dynamic configuration enabled, so I have a lot of misgivings about this development path. So, I guess if --enable-dynamic-configuration=false is indeed causing the problem, they are going to remove that option, so problem gone.. :(

@ElvinEfendi
Copy link
Member

@gjcarneiro I know it can be frustrating when things don't work as expected. We are doing our best to make ingress-nginx better. There were valid reasons to switch to dynamic mode and many users benefit from it. Going forward to support both modes is not feasible for us few maintainers.

I see that you've referenced previous version's changelog, but have you tried the latest version https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.21.0 (dynamic mode only)? We have fixed several bugs in that release.

I've had a poor experience of dynamic configuration enabled

Instead of taking a step backwards and going to non dynamic mode can you try the latest version (0.21.0) and let us know what's the poor experience you were referring to? There's nothing fundamentally broken with dynamic mode as far as I know and it provides important benefits. We can all work together to fix small issues that arises.

@vishksaj
Copy link

I am facing the same issue on nginx-ingress-controller:0.20.0. All the major upstream related metrics and nginx are gone from the scrape service endpoints. Is this issue got addressed in latest version (0.21.0). Any permanent solutions ? .

@YuraBeznos
Copy link

We have different k8s clusters with ingress-nginx 0.21.0. (kubernetes 1.9.10)

The problem is that keys nginx_ingress_controller_ssl_expire_time_seconds aren't exists in one cluster but available in another.

I am just getting /metrics from ingress-nginx via Kubernetes API (with kubectl proxy).

We run nginx-ingress-controller with parameters like:

--configmap=kube-extra/nginx-ingress-controller 
--default-ssl-certificate=kube-extra/wildcard-internal-ingress 
--tcp-services-configmap=kube-extra/nginx-ingress-controller-tcp-ports 
--sort-backends=true 
--annotations-prefix=ingress.kubernetes.io 
--enable-ssl-chain-completion=false

Configuration for different clusters almost the same (except domains/certificates etc)

The main difference I can see between clusters is the errors like:

Error getting SSL certificate "somename/somedomain": local SSL certificate somename/somedomain was not found. Using default certificate

for cluster where nginx_ingress_controller_ssl_expire_time_seconds keys aren't available.

If somebody have any suggestions I'll be happy to hear them.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jaksky
Copy link

jaksky commented Mar 5, 2020

Can somebody share the resolution?

@gjcarneiro
Copy link
Author

Well, meanwhile I upgraded to the latest version, and all metrics are there. I'm pretty sure this is fixed now. No idea which exact version fixed it.

@Eduardo1911
Copy link

Eduardo1911 commented Sep 15, 2020

@gjcarneiro which version do you currently use?
I am on 0.26.1 and getting somewhat the same issue.

[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
4437
[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
1848

One deployment with 2 replicas, one of the pods is missing the ssl expiry metrics which i am interested in. This happens regardless of how many times i will recreate the pod.

@gjcarneiro
Copy link
Author

Also using 0.26.1. I don't know if all the metrics have been preserved, but they're essentially there.

For the case of ssl expiry, we have metrics:

$ http get 10.134.4.40:10254/metrics | grep nginx_ingress_controller_ssl_expire_time_seconds | wc -l
49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests