-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix histogram consistency in PrometheusMeterRegistry #5193
Comments
I just had the exact same problem; interesting side note: The Exception only occured once while scraping for us - meaning the next scrape 15 seconds later did not have the problem anymore. Not like my other reported issue that breaks scraping completely. |
Hello, Do you have any plans to release soon, covering also this bug? |
WorkaroundIt seems this is only an issue with the Prometheus 1.x Client, using the 0.x Client does not seem to have this issue. As a workaround you can downgrade to the 0.x client by using Investigation NotesBased on the additional data in #5223, thanks for @VladimirZaitsev21, I think the issue is similar to #4988 but it definitely seems like a different one: in #4988, the issue was limited to Based on the data from #5223, it seems the issue is caused by inconsistency between
As you can see here, count is Lines 494 to 498 in f4be539
If I need to guess, this is a timing/concurrency issue with the histogram implementation we have for Prometheus and count and the last bucket are sometimes inconsistent (e.g.: scrape happens during a recording and the bucket is already incremented while the counter is not). |
I'm still investigating but it seems like the issue is what I described above (inconsistency between count and buckets that were not surfaced with the 0.x Client), here's a reproducer if you want to play with it: public class HistogramDemo {
static PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
static Timer timer = Timer.builder("test").publishPercentileHistogram().register(registry);
static ExecutorService scrapeExecutor = Executors.newFixedThreadPool(16);
static ExecutorService recordExecutor = Executors.newFixedThreadPool(16);
static CountDownLatch latch = new CountDownLatch(1);
public static void main(String[] args) {
List<Future<?>> futures = new ArrayList<>();
for (int i = 0; i < 32; i++) {
futures.add(scrapeExecutor.submit(HistogramDemo::scrape));
futures.add(recordExecutor.submit(HistogramDemo::record));
}
System.out.println("Tasks submitted, releasing the Kraken...");
latch.countDown();
waitForFutures(futures);
scrapeExecutor.shutdown();
recordExecutor.shutdown();
System.out.println(registry.scrape());
}
static void record() {
waitForLatch();
timer.record(Duration.ofMillis(100));
}
static void scrape() {
waitForLatch();
registry.scrape();
}
static void waitForLatch() {
try {
latch.await();
}
catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
static void waitForFutures(List<Future<?>> futures) {
for (Future<?> future : futures) {
try {
future.get();
}
catch (Exception e) {
future.cancel(true);
System.out.println(e.getMessage());
// e.printStackTrace();
}
}
}
} If I also print out the count and the negative buckets, I get negative values for the |
I think I fixed this in 796c1e5 (please see the commit message for details). I also added some jcstress tests.
|
Hello, I see that the issue is fixed, can you please tell us when the fix will be published for production use? |
If you click on the milestone (top right) you can find an estimated release date. |
…ring boot error: j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative. refs: micrometer-metrics/micrometer#5193
As part of upgrade to spring boot 3.3 with micrometer 1.13.0 we are seeing issue similar to #4988, see the stacktrace below. I wasn't able to create separate reproducer, I can confirm that the issue is in micrometer as when using spring-boot 3.3 with downgraded micrometer to 1.12.6 the issue is no more visible.
Stacktrace:
Note our application is standard REST application using Mongo & Feign clients to do requests against external endpoints. It also creates custom metrics with percentile histograms like this:
Note: we use Micrometer with prometheus and running on Java 17.0.11
I wasn't able to reproduce the issue outside of the production environment. Any suggestions for a fix or what could have caused it?
The text was updated successfully, but these errors were encountered: