Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix histogram consistency in PrometheusMeterRegistry #5193

Closed
rhatlapa opened this issue Jun 7, 2024 · 7 comments
Closed

Fix histogram consistency in PrometheusMeterRegistry #5193

rhatlapa opened this issue Jun 7, 2024 · 7 comments
Labels
bug A general bug registry: prometheus A Prometheus Registry related issue
Milestone

Comments

@rhatlapa
Copy link

rhatlapa commented Jun 7, 2024

As part of upgrade to spring boot 3.3 with micrometer 1.13.0 we are seeing issue similar to #4988, see the stacktrace below. I wasn't able to create separate reproducer, I can confirm that the issue is in micrometer as when using spring-boot 3.3 with downgraded micrometer to 1.12.6 the issue is no more visible.

Stacktrace:

j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative.
	at i.p.m.m.s.ClassicHistogramBuckets.validate(ClassicHistogramBuckets.java:147)
	at i.p.m.m.s.ClassicHistogramBuckets.sortAndValidate(ClassicHistogramBuckets.java:108)
	at i.p.m.m.s.ClassicHistogramBuckets.of(ClassicHistogramBuckets.java:53)
	at i.m.p.PrometheusMeterRegistry.lambda$addDistributionStatisticSamples$29(PrometheusMeterRegistry.java:504)
	at i.m.p.MicrometerCollector.collect(MicrometerCollector.java:77)
	at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
	at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:57)
	at o.s.b.a.m.e.p.PrometheusScrapeEndpoint.scrape(PrometheusScrapeEndpoint.java:59)
	at j.i.r.GeneratedMethodAccessor128.invoke(Unknown Source)
	at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at j.l.reflect.Method.invoke(Method.java:568)
	at o.s.u.ReflectionUtils.invokeMethod(ReflectionUtils.java:281)
	at o.s.b.a.e.i.r.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:74)
	at o.s.b.a.e.a.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60)
	at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:327)
	at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$OperationHandler.handle(AbstractWebMvcEndpointHandlerMapping.java:434)
	at j.i.r.GeneratedMethodAccessor110.invoke(Unknown Source)
	at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at j.l.reflect.Method.invoke(Method.java:568)
	at o.s.w.m.s.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255)
	at o.s.w.m.s.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188)
	at o.s.w.s.m.m.a.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
	at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926)
	at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831)
	at o.s.w.s.m.m.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
	at o.s.b.a.a.w.s.CompositeHandlerAdapter.handle(CompositeHandlerAdapter.java:58)
	at o.s.w.s.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
	at o.s.w.s.DispatcherServlet.doService(DispatcherServlet.java:979)
	at o.s.w.s.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
	at o.s.w.s.FrameworkServlet.doGet(FrameworkServlet.java:903)
	... 29 frames truncated

Note our application is standard REST application using Mongo & Feign clients to do requests against external endpoints. It also creates custom metrics with percentile histograms like this:

		meterRegistry.counter(counter_metric_name, tags.stream().collect(Collectors.toList()).increment();
		Timer.builder(timer_metric_name)
				.tags(tags.stream().collect(Collectors.toList()))
				.publishPercentileHistogram()
				.register(meterRegistry).record(duration);

Note: we use Micrometer with prometheus and running on Java 17.0.11

I wasn't able to reproduce the issue outside of the production environment. Any suggestions for a fix or what could have caused it?

@knoobie
Copy link

knoobie commented Jun 7, 2024

I just had the exact same problem; interesting side note: The Exception only occured once while scraping for us - meaning the next scrape 15 seconds later did not have the problem anymore. Not like my other reported issue that breaks scraping completely.

@nadina-f
Copy link

Hello,
After upgrading my project from Spring Boot 3.2.5 to 3.3.0 and micrometer version from 1.12.5 to 1.13.0, I have the same issue.
Theoretically this issue should have bee fixed in 1.13.0 as described here:
d7b9d24, but it's not.

Do you have any plans to release soon, covering also this bug?
Thank you!

@jonatan-ivanov
Copy link
Member

Workaround

It seems this is only an issue with the Prometheus 1.x Client, using the 0.x Client does not seem to have this issue. As a workaround you can downgrade to the 0.x client by using micrometer-registry-prometheus-simpleclient instead of micrometer-registry-prometheus.
Please see the release notes of Micrometer 1.13.0 as well as the Migration Guide. In a nutshell: we upgraded micrometer-registry-prometheus from Prometheus Client 0.x to 1.x but also kept the old client as-is under a different artifact: micrometer-registry-prometheus-simpleclient.

Investigation Notes

Based on the additional data in #5223, thanks for @VladimirZaitsev21, I think the issue is similar to #4988 but it definitely seems like a different one: in #4988, the issue was limited to LongTaskTimer and this seems to be happening for Timer (I guess also for DistributionSummary).

Based on the data from #5223, it seems the issue is caused by inconsistency between count and the last bucket:

HistogramSnapshot{
   count=575660,
   ...
   histogramCounts=[
       ...
       (575661.0 at 5.0E9)
   ]
}

As you can see here, count is 575660 while the last bucket is 575661.0. This should not happen, count should not be less than the last bucket. I'm not sure if we have this situation with the Prometheus Client 0.x (I think we do) but if this occurs, it does not cause a problem since we don't need to calculate the +Inf bucket there, with the Prometheus 1.x client we need to, see:

if (Double.isFinite(histogramCounts[histogramCounts.length - 1].bucket())) {
// ClassicHistogramBuckets is not cumulative
buckets.add(Double.POSITIVE_INFINITY);
counts.add(count - histogramCounts[histogramCounts.length - 1].count());
}

If I need to guess, this is a timing/concurrency issue with the histogram implementation we have for Prometheus and count and the last bucket are sometimes inconsistent (e.g.: scrape happens during a recording and the bucket is already incremented while the counter is not).

@jonatan-ivanov
Copy link
Member

I'm still investigating but it seems like the issue is what I described above (inconsistency between count and buckets that were not surfaced with the 0.x Client), here's a reproducer if you want to play with it:

public class HistogramDemo {
    static PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    static Timer timer = Timer.builder("test").publishPercentileHistogram().register(registry);

    static ExecutorService scrapeExecutor = Executors.newFixedThreadPool(16);
    static ExecutorService recordExecutor = Executors.newFixedThreadPool(16);
    static CountDownLatch latch = new CountDownLatch(1);

    public static void main(String[] args) {
        List<Future<?>> futures = new ArrayList<>();
        for (int i = 0; i < 32; i++) {
            futures.add(scrapeExecutor.submit(HistogramDemo::scrape));
            futures.add(recordExecutor.submit(HistogramDemo::record));
        }
        System.out.println("Tasks submitted, releasing the Kraken...");
        latch.countDown();
        waitForFutures(futures);
        scrapeExecutor.shutdown();
        recordExecutor.shutdown();
        System.out.println(registry.scrape());
    }

    static void record() {
        waitForLatch();
        timer.record(Duration.ofMillis(100));
    }

    static void scrape() {
        waitForLatch();
        registry.scrape();
    }

    static void waitForLatch() {
        try {
            latch.await();
        }
        catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    static void waitForFutures(List<Future<?>> futures) {
        for (Future<?> future : futures) {
            try {
                future.get();
            }
            catch (Exception e) {
                future.cancel(true);
                System.out.println(e.getMessage());
//                e.printStackTrace();
            }
        }
    }
}

If I also print out the count and the negative buckets, I get negative values for the +Inf bucket.

@jonatan-ivanov jonatan-ivanov changed the title Regression causign j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative. after upgrade to micrometer 1.13.0 Counts in ClassicHistogramBuckets cannot be negative Jun 13, 2024
@jonatan-ivanov jonatan-ivanov added bug A general bug and removed waiting-for-triage labels Jun 13, 2024
@jonatan-ivanov jonatan-ivanov added this to the 1.14.0-M1 milestone Jun 13, 2024
@jonatan-ivanov jonatan-ivanov modified the milestones: 1.14.0-M1, 1.13.2 Jun 13, 2024
@jonatan-ivanov jonatan-ivanov changed the title Counts in ClassicHistogramBuckets cannot be negative Fix histogram consistency in PrometheusMeterRegistry Jun 13, 2024
@jonatan-ivanov
Copy link
Member

I think I fixed this in 796c1e5 (please see the commit message for details). I also added some jcstress tests.

1.13.2-SNAPSHOT is published in our snapshot repo (see readme), could you please try out if you can still see the issue?

@anetamoravcikova-jamf
Copy link

Hello, I see that the issue is fixed, can you please tell us when the fix will be published for production use?

@knoobie
Copy link

knoobie commented Jun 27, 2024

If you click on the milestone (top right) you can find an estimated release date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: prometheus A Prometheus Registry related issue
Projects
None yet
Development

No branches or pull requests

6 participants