Many prometheus metrics disappeared after upgrade #3053

gjcarneiro · 2018-09-06T13:59:52Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): no

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): grafana, prometheus

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.19

Kubernetes version (use kubectl version): 1.11.0

Environment:

Cloud provider or hardware configuration: baremetal
OS (e.g. from /etc/os-release): ubuntu 16.04
Kernel (e.g. uname -a):
Install tools: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.19.0
Others:

What happened:

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

What you expected to happen:

metrics shouldn't disappear on upgrade.

How to reproduce it (as minimally and precisely as possible):
Scrape the prometheus metrics endpoint:

$ http get http://<POD_IP>:10254/metrics | grep nginx_ingress_controller_bytes

The grep returns empty. I have many metrics, but many are also missing. Here are the metrics it is returning now:

metrics.txt

Anything else we need to know:
Container arguments:

        - args:
          - /nginx-ingress-controller
          - --default-backend-service=$(POD_NAMESPACE)/default-http-backend
          - --configmap=$(POD_NAMESPACE)/nginx-configuration
          - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
          - --udp-services-configmap=$(POD_NAMESPACE)/udp-services
          - --publish-service=$(POD_NAMESPACE)/ingress-nginx
          - --annotations-prefix=nginx.ingress.kubernetes.io
          - --enable-dynamic-configuration=false

And configmap:

    compute-full-forwarded-for: "true"
    disable-ipv6: "true"
    disable-ipv6-dns: "true"
    load-balance: ip_hash
    proxy-read-timeout: "3600"
    proxy-send-timeout: "3600"
    use-proxy-protocol: "true"
    worker-processes: "4"
    worker-shutdown-timeout: "43200"

The text was updated successfully, but these errors were encountered:

aledbf · 2018-09-06T14:09:01Z

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

This works as expected. The prometheus metrics are not persistent. That means when you upgrade the version, there are no stats. You need traffic to get data from prometheus

gjcarneiro · 2018-09-06T14:55:27Z

But I do have traffic. This nginx IC is used in production and is working fine. The upgrade was over 24 hours ago and still no metrics.

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

aledbf · 2018-09-06T15:05:12Z

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

The metrics work with or without dynamic mode

gjcarneiro · 2018-09-06T15:14:04Z

I was able to reproduce in the dev environment. Downgrade to 0.18 -> it works. Go back to 0.19 -> metrics gone.

danielfm · 2018-09-10T18:37:11Z

I hit this as well.

I was running 0.17.1 and upgraded to 0.19.0, and several metrics apparently stopped being reported by the metrics endpoint. As mentioned by @gjcarneiro, downgrading to 0.18.0 also restored the lost metrics on my end (I used the latest nginx-ingress chart version with both).

Metrics dump for both versions (some labels were redacted):
https://gist.github.com/danielfm/d429b8fa055671d6fccb1ee1c1863ab9

I checked the changelog between 0.18.0 and 0.19.0, but could not find any change that would explain this, so any help is greatly appreciated.

sergelogvinov · 2018-09-15T20:34:02Z

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

opskumu · 2018-09-18T06:48:05Z

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

davidcodesido · 2018-09-18T07:30:54Z

Same here. New installation on 0.19.0 and I've been struggling through all the examples of metrics online without seeing them in my nginx-ingress controller, now I've seen this ticket.

opskumu · 2018-09-21T09:55:50Z

I used a custom template configuration nginx.tmpl with 0.18.0, when upgrade from 0.18.0 --> 0.19.0, template not updated to new. After update nginx.tmpl to 0.19.0, metrics work ok.

danielfm · 2018-09-21T10:49:57Z

I'm not using custom templates in my configuration. In my case, as I said before, I tried the exact same configuration with the exact same nginx-ingress chart version both with v0.18.0 and v0.19.0, and several metrics stopped being reported in v0.19.0 (see the GitHub gist I posted earlier for more information).

rlees85 · 2018-10-12T11:19:51Z

I've just hit this too....

Nginx Ingress: 0.20.0
Kubernetes: 1.9

Any work around?

edit: using standard templates, but with custom snippets
further edit: I am getting the ingress metrics, but NO Nginx metrics, specially 2xx 3xx 4xx 5xx codes

BlueBlue-Lee · 2018-10-19T06:57:02Z

I have the same problem in our production environment. Prometheus can't collect many metrics.

nginx-ingress-controller:0.20.0
kubernetes v1.10.0
--enable-dynamic-configuration=false

Also enable-dynamic-configuration shouldn't have any relation with metrics, but it does seem have.

@aledbf @ElvinEfendi @nicksardo

rlees85 · 2018-10-19T21:24:01Z

I'm using the default setting for enable-dynamic-configuration which I believe is true. I also tried setting it to false. Couldn't get all the metrics either way on 0.19.0 and above.

bernardoVale · 2018-10-21T00:25:12Z

Same issue here on 0.20.0 and dev

bernardoVale · 2018-10-21T00:58:36Z

it works fine on 0.18.0

paalkr · 2018-10-23T05:07:32Z

I have the same issue with 0.19.0 and 0.20.0, No problem with 0.18.0

ElvinEfendi · 2018-10-26T09:28:07Z

Is everyone missing only Nginx metrics and have no issue with controller metrics? If not is there a pattern what metrics are missing?

Are you using custom template?

Do you see one or more of the following messages in the logs?
error when setting up timer.every
omitting metrics for the request, current batch is full
error while encoding metrics

Do you see any other Nginx error in the logs?

Can you strace a Nginx worker and see whether it's writing to unix:/tmp/prometheus-nginx.socket and what's it writing (it should be Nginx metrics such HTTP status, full list is at

ingress-nginx/rootfs/etc/nginx/lua/monitor.lua

Line 19 in bc6f2e7

local function metrics()

)?

Can you also strace controller process and see whether it's reading from the same socket?

Thubo · 2018-10-31T08:41:22Z

I'm hitting the same issue on Version 0.20.0:

With --enable-dynamic-configuration=false the metrics are missing, while with --enable-dynamic-configuration=true metrics are exported as expected.

sczizzo · 2018-11-01T22:32:51Z

@ElvinEfendi I think I can provide some more context.

We noticed this after upgrading from chart version 0.17.2 to 0.29.1. We do indeed have --enable-dynamic-configuration=false set after upgrading to 0.29.1. We did not have that set previously, but this option was already false by default in 0.17.2.

(I don't think v0.17.2 or 0.29.1 is particularly important, it's just what we had deployed. As others have noted, this problem seems to have first appeared in chart version 0.19 and only with dynamic configuration disabled.)

So I took a snapshot of /metrics on both versions:

I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt

So to answer your first set of questions, we're definitely getting some metrics. Also appears the naming scheme changed a bit between these releases.

We used to be able to, say, get the total number of responses by server_zone like so:

sum(rate(nginx_responses_total{job="nginx-ingress"}[5m])) by (server_zone)

But I don't see any way to accomplish this given the new set of metrics.

As to your second set of questions, we did not see any instance of error when setting up timer.every, but we did see lots of omitting metrics for the request, current batch is full error while encoding metrics. As far as I can tell, with dynamic configuration disabled, monitor.init_worker() is never called, so we wouldn't expect to see that timer.every error (perhaps init_worker should be running in this case?). Otherwise no errors really jump out at me.

I've haven't spent too much time digging into the running containers yet, but with chart 0.17.2 there actually is no /tmp/prometheus-nginx.socket; apparently that was added later. With v0.29.1 I was able to run strace on both the nginx process and the controller process in addition to nc -U; didn't see any sign that socket was written.

begemotik · 2018-11-23T15:32:31Z

We experience the same issue here with 0.19.0 version, with custom nginx.tmpl we need that enable-dynamic-configuration was explicitly set to false, but it leads to lack of metrics. Is there any work on that going on?

gjcarneiro · 2018-11-23T15:40:04Z

From the changelog, it seems that --enable-dynamic-configuration=false will disappear as option, and dynamic configuration will always be enabled.

To be honest, I've had a poor experience of dynamic configuration enabled, so I have a lot of misgivings about this development path. So, I guess if --enable-dynamic-configuration=false is indeed causing the problem, they are going to remove that option, so problem gone.. :(

ElvinEfendi · 2018-11-25T08:41:58Z

@gjcarneiro I know it can be frustrating when things don't work as expected. We are doing our best to make ingress-nginx better. There were valid reasons to switch to dynamic mode and many users benefit from it. Going forward to support both modes is not feasible for us few maintainers.

I see that you've referenced previous version's changelog, but have you tried the latest version https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.21.0 (dynamic mode only)? We have fixed several bugs in that release.

I've had a poor experience of dynamic configuration enabled

Instead of taking a step backwards and going to non dynamic mode can you try the latest version (0.21.0) and let us know what's the poor experience you were referring to? There's nothing fundamentally broken with dynamic mode as far as I know and it provides important benefits. We can all work together to fix small issues that arises.

vishksaj · 2018-12-26T11:35:55Z

I am facing the same issue on nginx-ingress-controller:0.20.0. All the major upstream related metrics and nginx are gone from the scrape service endpoints. Is this issue got addressed in latest version (0.21.0). Any permanent solutions ? .

YuraBeznos · 2019-01-09T14:29:14Z

We have different k8s clusters with ingress-nginx 0.21.0. (kubernetes 1.9.10)

The problem is that keys nginx_ingress_controller_ssl_expire_time_seconds aren't exists in one cluster but available in another.

I am just getting /metrics from ingress-nginx via Kubernetes API (with kubectl proxy).

We run nginx-ingress-controller with parameters like:

--configmap=kube-extra/nginx-ingress-controller 
--default-ssl-certificate=kube-extra/wildcard-internal-ingress 
--tcp-services-configmap=kube-extra/nginx-ingress-controller-tcp-ports 
--sort-backends=true 
--annotations-prefix=ingress.kubernetes.io 
--enable-ssl-chain-completion=false

Configuration for different clusters almost the same (except domains/certificates etc)

The main difference I can see between clusters is the errors like:

Error getting SSL certificate "somename/somedomain": local SSL certificate somename/somedomain was not found. Using default certificate

for cluster where nginx_ingress_controller_ssl_expire_time_seconds keys aren't available.

If somebody have any suggestions I'll be happy to hear them.

fejta-bot · 2019-04-09T18:57:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-09T19:42:42Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-08T20:25:26Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-08T20:25:33Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jaksky · 2020-03-05T09:27:57Z

Can somebody share the resolution?

gjcarneiro · 2020-03-05T11:06:23Z

Well, meanwhile I upgraded to the latest version, and all metrics are there. I'm pretty sure this is fixed now. No idea which exact version fixed it.

Eduardo1911 · 2020-09-15T13:52:14Z

@gjcarneiro which version do you currently use?
I am on 0.26.1 and getting somewhat the same issue.

[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
4437
[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
1848

One deployment with 2 replicas, one of the pods is missing the ssl expiry metrics which i am interested in. This happens regardless of how many times i will recreate the pod.

gjcarneiro · 2020-09-15T14:26:32Z

Also using 0.26.1. I don't know if all the metrics have been preserved, but they're essentially there.

For the case of ssl expiry, we have metrics:

$ http get 10.134.4.40:10254/metrics | grep nginx_ingress_controller_ssl_expire_time_seconds | wc -l
49

aledbf closed this as completed Sep 6, 2018

aledbf reopened this Sep 6, 2018

AlexPereverzyev mentioned this issue Nov 10, 2018

Nginx OOM #3314

Closed

jonaz mentioned this issue Jan 31, 2019

Metrics does not count requests on ingresses without host #3713

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2019

k8s-ci-robot closed this as completed Jun 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many prometheus metrics disappeared after upgrade #3053

Many prometheus metrics disappeared after upgrade #3053

gjcarneiro commented Sep 6, 2018 •

edited

Loading

aledbf commented Sep 6, 2018

gjcarneiro commented Sep 6, 2018

aledbf commented Sep 6, 2018

gjcarneiro commented Sep 6, 2018

danielfm commented Sep 10, 2018

sergelogvinov commented Sep 15, 2018

opskumu commented Sep 18, 2018

davidcodesido commented Sep 18, 2018

opskumu commented Sep 21, 2018 •

edited

Loading

danielfm commented Sep 21, 2018

rlees85 commented Oct 12, 2018 •

edited

Loading

BlueBlue-Lee commented Oct 19, 2018

rlees85 commented Oct 19, 2018

bernardoVale commented Oct 21, 2018

bernardoVale commented Oct 21, 2018

paalkr commented Oct 23, 2018

ElvinEfendi commented Oct 26, 2018

Thubo commented Oct 31, 2018

sczizzo commented Nov 1, 2018 •

edited

Loading

begemotik commented Nov 23, 2018 •

edited

Loading

gjcarneiro commented Nov 23, 2018

ElvinEfendi commented Nov 25, 2018

vishksaj commented Dec 26, 2018

YuraBeznos commented Jan 9, 2019

fejta-bot commented Apr 9, 2019

fejta-bot commented May 9, 2019

fejta-bot commented Jun 8, 2019

k8s-ci-robot commented Jun 8, 2019

jaksky commented Mar 5, 2020

gjcarneiro commented Mar 5, 2020

Eduardo1911 commented Sep 15, 2020 •

edited

Loading

gjcarneiro commented Sep 15, 2020

Many prometheus metrics disappeared after upgrade #3053

Many prometheus metrics disappeared after upgrade #3053

Comments

gjcarneiro commented Sep 6, 2018 • edited Loading

aledbf commented Sep 6, 2018

gjcarneiro commented Sep 6, 2018

aledbf commented Sep 6, 2018

gjcarneiro commented Sep 6, 2018

danielfm commented Sep 10, 2018

sergelogvinov commented Sep 15, 2018

opskumu commented Sep 18, 2018

davidcodesido commented Sep 18, 2018

opskumu commented Sep 21, 2018 • edited Loading

danielfm commented Sep 21, 2018

rlees85 commented Oct 12, 2018 • edited Loading

BlueBlue-Lee commented Oct 19, 2018

rlees85 commented Oct 19, 2018

bernardoVale commented Oct 21, 2018

bernardoVale commented Oct 21, 2018

paalkr commented Oct 23, 2018

ElvinEfendi commented Oct 26, 2018

Thubo commented Oct 31, 2018

sczizzo commented Nov 1, 2018 • edited Loading

begemotik commented Nov 23, 2018 • edited Loading

gjcarneiro commented Nov 23, 2018

ElvinEfendi commented Nov 25, 2018

vishksaj commented Dec 26, 2018

YuraBeznos commented Jan 9, 2019

fejta-bot commented Apr 9, 2019

fejta-bot commented May 9, 2019

fejta-bot commented Jun 8, 2019

k8s-ci-robot commented Jun 8, 2019

jaksky commented Mar 5, 2020

gjcarneiro commented Mar 5, 2020

Eduardo1911 commented Sep 15, 2020 • edited Loading

gjcarneiro commented Sep 15, 2020

gjcarneiro commented Sep 6, 2018 •

edited

Loading

opskumu commented Sep 21, 2018 •

edited

Loading

rlees85 commented Oct 12, 2018 •

edited

Loading

sczizzo commented Nov 1, 2018 •

edited

Loading

begemotik commented Nov 23, 2018 •

edited

Loading

Eduardo1911 commented Sep 15, 2020 •

edited

Loading