Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

Closed
albertteoh opened this issue May 8, 2021 · 11 comments
Closed

Comments

@albertteoh
Copy link
Contributor

Describe the bug
When using prometheusremotewrite to export metrics to M3, I'm getting latencies that are over 200 years when queried from M3.

However, when scraping these metrics from prometheus, the latencies look correct.

Am I configuring something incorrectly?

Steps to reproduce

  1. Run opentelemetry-collector-contrib with the config attached in this issue.
  2. Send spans to the collector using omnition/synthetic-load-generator, and with the spanmetrics processor sending metrics to a prometheus exporter.
  3. Have a prometheus server running to scrape metrics from the prometheus exporter.
  4. Within opentelemetry collector, a prometheus receiver will also scrape metrics from the same prometheus exporter above and send it to the configured prometheusremotewrite exporter that will write to an M3 instance.
  5. Setup grafana to query both data sources (prometheus server and M3) and graph both timeseries with the same query.

What did you expect to see?
Identical 95th percentile latencies, or at least close enough to one another.

What did you see instead?
Latencies from M3 were over 200 years, whereas from Prometheus, they were a more sensible ~200ms.

Here are two screenshots of the same query executed against Prometheus and M3 data sources respectively:

Prometheus
Screen Shot 2021-05-08 at 10 30 09 pm

M3
Screen Shot 2021-05-08 at 10 29 54 pm

To reduce the search space by ruling out M3 and spanmetrics processor as possible causes, I also checked the logs (these are from an earlier run):

Here, I log the total latency_count as well as the latency_bucket counts within spanmetrics processor. I've taken logs from two different times, 10 seconds apart and as you can see, the count is consistent with the sum of bucket_counts:

2021-05-08T18:54:20.947+1000    debug   [email protected]/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 2, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]}

2021-05-08T18:54:30.958+1000    debug   [email protected]/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 3, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0]}

However, this is the log output from the last metrics pipeline in the config below, i.e.:

    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

As you can see the total count is 1 but the bucket count total is 2 + 1 = 3, and so I believe the +Inf tries to account for this discrepancy, resulting in -2 represented as the uint64 equivalent of 18446744073709551614. I have also seen versions in logs where the total count > sum of bucket counts, leading to a "positive" spillover +Inf count.

HistogramDataPoints open-telemetry/opentelemetry-collector#12
Data point labels:
     -> operation: /checkout
     -> service_name: frontend
     -> span_kind: SPAN_KIND_CLIENT
     -> status_code: STATUS_CODE_UNSET
StartTimestamp: 2021-05-08 08:54:22.433 +0000 UTC
Timestamp: 2021-05-08 08:54:32.437 +0000 UTC
Count: 1
Sum: 1708.000000
ExplicitBounds #0: 2.000000
ExplicitBounds open-telemetry/opentelemetry-collector#1: 6.000000
ExplicitBounds open-telemetry/opentelemetry-collector#2: 10.000000
ExplicitBounds open-telemetry/opentelemetry-collector#3: 100.000000
ExplicitBounds open-telemetry/opentelemetry-collector#4: 250.000000
ExplicitBounds open-telemetry/community#39: 300.000000
ExplicitBounds open-telemetry/opentelemetry-collector#6: 400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#7: 800.000000
ExplicitBounds open-telemetry/opentelemetry-collector#8: 1000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#9: 1400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#10: 2000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#11: 5000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#12: 15000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#13: 30000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#14: 120000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#15: 9223372036854.000000
Buckets #0, Count: 0
Buckets open-telemetry/opentelemetry-collector#1, Count: 0
Buckets open-telemetry/opentelemetry-collector#2, Count: 0
Buckets open-telemetry/opentelemetry-collector#3, Count: 0
Buckets open-telemetry/opentelemetry-collector#4, Count: 0
Buckets open-telemetry/community#39, Count: 0
Buckets open-telemetry/opentelemetry-collector#6, Count: 0
Buckets open-telemetry/opentelemetry-collector#7, Count: 0
Buckets open-telemetry/opentelemetry-collector#8, Count: 0
Buckets open-telemetry/opentelemetry-collector#9, Count: 0
Buckets open-telemetry/opentelemetry-collector#10, Count: 2
Buckets open-telemetry/opentelemetry-collector#11, Count: 1
Buckets open-telemetry/opentelemetry-collector#12, Count: 0
Buckets open-telemetry/opentelemetry-collector#13, Count: 0
Buckets open-telemetry/opentelemetry-collector#14, Count: 0
Buckets open-telemetry/opentelemetry-collector#15, Count: 0
Buckets open-telemetry/opentelemetry-collector#16, Count: 18446744073709551614

What version did you use?
Version: opentelemetry-collector-contrib@master

What config did you use?
Config: (e.g. the yaml config file)

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'atm'
        scrape_interval: 10s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]

  jaeger:
    protocols:
      thrift_http:
        endpoint: "0.0.0.0:14278"

  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:65535"

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  logging:
    loglevel: debug

  jaeger:
    endpoint: "localhost:14250"
    insecure: true

  prometheusremotewrite:
    endpoint: "http://localhost:7201/api/v1/prom/remote/write"
    insecure: true

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    insecure: true


processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    # default (in ms): [2, 4, 6, 8, 10, 50, 100, 200, 400, 800, 1000, 1400, 2000, 5000, 10_000, 15_000]
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 300ms, 400ms, 800ms, 1s, 1.4s, 2s, 5s, 15s, 30s, 120s]

extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics]
      exporters: [jaeger]
    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus, logging]
    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

Environment
OS: MacOS
Compiler(if manually compiled): go 1.16

Additional context
cc @bogdandrutu

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021
@alolita alolita added the comp:prometheus Prometheus related issues label Sep 2, 2021
@ankitnayan
Copy link

@albertteoh Could you make this work? Or any workaround?

@albertteoh
Copy link
Contributor Author

Hi @ankitnayan, my workaround was to filter out any latencies > 24 hours, which isn't nice but it does the job for my use case at least.

@luistilingue
Copy link

luistilingue commented Mar 10, 2022

@albertteoh I have a similar issue, but using Prometheus.

spanMetricsProcessor is creating such bucket le="9.223372036854775e+12":

latency_bucket{http_status_code="200",operation="/health/xxxxx/health/**",service_name="xxxxx",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 1

I guess the code of spanMetricsProcessor need to deal with golang number conversion when using float64.

Please have a look at these lines https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L140

I've simulated here and the behavior was the same.

So, when Prometheus read that "number" it behavior in such way. I don't know if this doc can help.

@ankitnayan
Copy link

@luistilingue I upgraded to v0.43.0 and it seems to have been fixed. Which version are you using?

@luistilingue
Copy link

@ankitnayan I was using 0.40.0 and I upgraded all my stack (otel collector to 0.46.0, prometheus to 2.33.5, and javaagent to 1.11.1), but the issue still occurs.

hex1848 pushed a commit to hex1848/opentelemetry-collector-contrib that referenced this issue Jun 2, 2022
Bump github.com/klauspost/compress from 1.14.4 to 1.15.0
Bump github.com/shirou/gopsutil/v3 from 3.22.1 to 3.22.2
Bump go.uber.org/multierr from 1.7.0 to 1.8.0
@balintzs
Copy link
Contributor

balintzs commented Oct 19, 2022

Hello all, I believe this is caused by a bug we found in prometheus that causes the +Inf bucket to be added incorrectly, which in turn results in a negative number when converting cumulative datapoints to delta: prometheus/client_golang#1147

We faced an issue whereby New Relic dropped our datapoints because of this. The issue existed with 0.61.0 but became much worse with 0.62.0. We built a custom image updating github.com/prometheus/client_golang to the SHA version (dcea97eee2b3257f34fd3203cb922eedeabb42a6) that contained our fix and the issue disappeared:
Screenshot 2022-10-19 at 12 48 49

cc @TylerHelmuth

@balintzs
Copy link
Contributor

@github-actions
Copy link
Contributor

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

1 similar comment
@github-actions
Copy link
Contributor

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 19, 2022
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants