Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics of type "counter" with labels of differing values are not shipped to Newrelic #39

Closed
ranimufid opened this issue Mar 5, 2020 · 13 comments
Assignees
Labels
feature request Categorizes issue or PR as related to a new feature or enhancement. support Categorizes issue or PR as a support question.

Comments

@ranimufid
Copy link

ranimufid commented Mar 5, 2020

Background
I am currently attempting to scrape and push pulsar component prometheus metrics into Newrelic.

  • nri-prometheus image version: newrelic/nri-prometheus:1.3.0

  • config map:

scrape_duration: "20s"
verbose: false
insecure_skip_verify: true
scrape_enabled_label: "prometheus.io/scrape"
require_scrape_enabled_label_for_nodes: true

Issue 1 🚨
It seems that the nri-prometheus is unable to process metrics with labels of differing values for type counter:

Does not arrive at Newrelic 👎

bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 24684154

Arrives at Newrelic 👍

bookie_WRITE_BYTES 20991304885

Issue 2 🚨
We also see tonnes of error messages in the logs for metrics of type summary with NaN values

Example

{"err":"invalid float is NaN","message":"invalid gauge field","name":"bookie_journal_JOURNAL_CREATION_LATENCY.percentiles"}

And the corresponding metric values:

# TYPE bookie_journal_JOURNAL_CREATION_LATENCY summary
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.5"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.75"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.95"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.99"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.999"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.9999"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="1.0"} -Infinity

Help regarding understanding both the above behaviours would be greatly appreciated :)!

@ranimufid ranimufid changed the title Metrics of type "counter" & with labels of differing values are not shipped to Newrelic Metrics of type "counter" with labels of differing values are not shipped to Newrelic Mar 5, 2020
@douglascamata
Copy link

douglascamata commented Mar 10, 2020

Hi @ranimufid,

On Issue 1:

Do you know if these counters are changing over time?

Counters are stored in New Relic by the variation in the metric between two different runs of the integration. For example, if we get 24684154 on the first run for a given metric and the value is incremented by 1 in the second run, 1 will be sent to New Relic. If the value doesn't change, zeroes will be sent.

I'm working on confirming if this is the intended behaviour.

On issue 2

NaNs and infinities are not sent to New Relic. This integration is using the go-telemetry-sdk, and in the Telemetry SDK specs the following in stated:

(...) in languages where NaN or Infinity can be represented these values may be stored but can not be correctly marshalled to JSON and thus are dropped when JSON marshalling occurs because it violates the safety of the payload.

@douglascamata douglascamata self-assigned this Mar 10, 2020
@douglascamata douglascamata added bug Categorizes issue or PR as related to a bug. help wanted support Categorizes issue or PR as a support question. labels Mar 10, 2020
@ranimufid
Copy link
Author

Thanks for your response @douglascamata!

Noted on Issue 2.

Issue 1
I can confirm that the metric values change over time

#Tue Mar 10 15:38:08 UTC 2020
bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 210739
# Tue Mar 10 15:38:20 UTC 2020
bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 210740

The problem also is that the metric doesn't show up in newrelic at all. I can't find any instance of bookie_journal_JOURNAL_SYNC_count. This is the NRQL I used to verify:

FROM Metric SELECT uniques(metricName) where clusterName='aud-pulsar-testing' and metricName='bookie_journal_JOURNAL_SYNC_count'

@douglascamata
Copy link

Interesting!

I cannot reproduce it locally using the exact same configuration as you. Do you mind checking for for ingest errors using the following query: SELECT count(*) FROM NrIntegrationError WHERE newRelicFeature ='Metrics' facet category, message limit 1000 since 24 hours ago?

@ranimufid
Copy link
Author

I get the following message on executing the query you shared:

No events found -- do you have the correct event type and time range?

In my initial attempt, I ran nri-prometheus as a kubernetes deployment. I've now started it as a docker container that is scraping some remote endpoints. Sadly I still observe the same behaviour 😢 This is my new config:

cluster_name: "aud-pulsar-testing"
scrape_duration: "20s"
scrape_timeout: "5s"
verbose: true
insecure_skip_verify: false
scrape_enabled_label: "prometheus.io/scrape"
require_scrape_enabled_label_for_nodes: false

targets:
- description: Pulsar Broker URLs
  urls: ["http://url:8080/metrics/","http://url:8080/metrics/","http://url:8080/metrics/"]
- description: Bookkeeper URLs
  urls: ["http://url:8000/metrics","http://url:8000/metrics","http://url:8000/metrics"]
- description: Zookeeper URLs
  urls: ["http://url:8000/metrics","http://url:8000/metrics","http://url:8000/metrics"]

When attempting to reproduce the issue locally on your end, did you supply your code with the same metrics I shared?

bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 210739

Is there anything that I can provide you which can be of help in troubleshooting this issue?

@douglascamata
Copy link

I'm trying it locally with exactly these two metrics, giving them the counter type, and using a "static file exporter":

############################################
# Prometheus exporter for K8s that serves
# metrics from a plain text file
############################################
apiVersion: apps/v1
kind: Deployment
metadata:
  name: from-file-prometheus-exporter
  labels:
    app: from-file-prometheus-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: from-file-prometheus-exporter
  template:
    metadata:
      labels:
        app: from-file-prometheus-exporter
        prometheus.io/scrape: "true"
    spec:
      containers:
        - name: from-file-prometheus-exporter
          image: python:alpine3.9
          env:
            - name: METRICS_FILE_URL
              value: "<URL_TO_DOWNLOAD_FILE>" # You can use private gist url.
          ports:
            - name: metrics
              containerPort: 8080
          command: ["/bin/sh","-c"]
          # The reason of using a URL instead of a config map is that the latest has a limitation of up to 1MB
          args: ["wget $METRICS_FILE_URL -O /etc/from-file-prometheus-exporter/metrics; python -m http.server -b 0.0.0.0 -d /etc/from-file-prometheus-exporter/ 8080"]
          volumeMounts:
            - mountPath: /etc/from-file-prometheus-exporter/
              name: metrics-dir
          readinessProbe:
            httpGet:
              path: /
              port: metrics
            initialDelaySeconds: 10
            periodSeconds: 15
      volumes:
        - name: metrics-dir
          emptyDir: {}

My static file looks like this:

# TYPE bookie_journal_JOURNAL_SYNC_count counter
bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 24684154

And my config.yml file in the config map has this:

    cluster_name: "zdcamata-pomi"
    scrape_duration: "20s"
    scrape_timeout: "1m"
    verbose: true
    scrape_enabled_label: "prometheus.io/scrape"
    require_scrape_enabled_label_for_nodes: true
    transformations:
      - description: "General processing rules"
        add_attributes:
          - metric_prefix: ""
            attributes:
              my_extra_attr: "my-value"
        rename_attributes:
          - metric_prefix: ""
            attributes:
              container_name: "containerName"
              pod_name: "podName"
              namespace: "namespaceName"
              node: "nodeName"
              container: "containerName"
              pod: "podName"
              deployment: "deploymentName"
        ignore_metrics:
          - prefixes:
              - go_
              - http_
              - process_

Can you try this, please, and tell me if it works? Also have a look at the logs to see if there is something weird -- it will be in verbose mode.

Signing off until tomorrow's working hours in CET. 👋

@douglascamata
Copy link

Ah, something else: enabling verbose logs in your current setup and sending them over might help. Remember to redact any information you might not want to share.

@ranimufid
Copy link
Author

ranimufid commented Mar 11, 2020

Hey @douglascamata. I set up the static metric exporter like you said and I pointed my nri-prometheus docker container to curl that static endpoint. Here are my observations:

Take 1
I dumped the entire payload returned by the metrics endpoint into the static file and I ended up observing the same behavior I reported initially: certain metrics do not end up in Newrelic

Take 2
I removed all metrics except for the failing ones from the static metrics file and the desired metrics started showing up in newrelic. My suspicion is that there's a metric/comment on a line before or after my desired metrics which causes nri-prometheus to break? I have attached the full prometheus payload so you can simulate exactly what nri-prometheus gets from my upstream servers.

From the provided file, the following are the metrics i'd like to see in Newrelic, but don't get shipped:

bookkeeper_server_ADD_ENTRY_count
bookkeeper_server_READ_ENTRY_count
bookkeeper_server_ADD_ENTRY_REQUEST
bookkeeper_server_READ_ENTRY_REQUEST

bookkeeper-metrics.log

@douglascamata
Copy link

@ranimufid thanks for the update! I'll have a look at this and get back to you soon. Our latest release, v1.3.0, was exactly the one where we upgraded the Go Telemetry SDK for the NaN/Infinity support. There could be something there 🕵

@douglascamata
Copy link

Aaaand I'm back!

I double checked our (New Relic's) specs on quantized metric types (histograms and percentiles, mapping to Prometheus histograms and summaries), checked the code and spoke to some colleagues.

We are not sending any *_count or *_sum of Summary metrics to New Relic.

We are aware that the support for these metric types is underwhelming. Please note that they are in WIP state and will improve in the future.

@ranimufid
Copy link
Author

thanks for your speedy response @douglascamata! May I please know if you guys have a rough plan as to when you will be incorporating these metrics in nri-prometheus?

That aside, are you aware of any alternatives to getting these metrics shipped to Newrelic?

@douglascamata
Copy link

Unfortunately our roadmap of FY20 (fiscal year 20, starting in April) is not yet decided on, so I don't have even a rough plan to share yet, sorry. 😞

My recommendation to get these counts shipped (right now) very likely isn't practical. It would involve using directly the Go Telemetry SDK to parse the Prometheus metrics at a lower level and send them to New Relic.

@douglascamata douglascamata added feature request Categorizes issue or PR as related to a new feature or enhancement. and removed bug Categorizes issue or PR as related to a bug. labels Mar 20, 2020
@douglascamata
Copy link

@ranimufid this should be improved by #54, which is in the v2.0.0-rc1 pre-release. It contains more information about the changes being done to histograms and summaries.

@ranimufid
Copy link
Author

Hey @douglascamata , thanks for getting back to me on this!

I've updated my nri-prometheus deployment to use the latest 2.0.0 image, however, i'm still unable to see the following metrics in newrelic

bookkeeper_server_ADD_ENTRY_count
bookkeeper_server_READ_ENTRY_count
FROM Metric select uniques(metricName) where integrationName = 'nri-prometheus' and metricName='bookkeeper_server_ADD_ENTRY_count'

^ doesn't return any records

I am able to however see these metrics now, so it seems we're on the right track somehow

bookkeeper_server_ADD_ENTRY_REQUEST
bookkeeper_server_READ_ENTRY_REQUEST

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Categorizes issue or PR as related to a new feature or enhancement. support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

4 participants