-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter]prometheusremotewrite] Fix data race by introducing pool of batch state #36601
[exporter]prometheusremotewrite] Fix data race by introducing pool of batch state #36601
Conversation
Signed-off-by: Arthur Silva Sens <[email protected]>
Signed-off-by: Arthur Silva Sens <[email protected]> (cherry picked from commit bdeb254)
Signed-off-by: Arthur Silva Sens <[email protected]>
The original issue talked about batch sizes of 100-200k data points, so maybe 10k isn't enough to illustrate the issue? |
Signed-off-by: Arthur Silva Sens <[email protected]> (cherry picked from commit 2dc1870)
I've increased the amount of data used in the benchmarks, still no difference. arthursens$ benchstat main.txt withoutState.txt syncpool.txt
goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter
cpu: Apple M2 Pro
│ main.txt │ withoutState.txt │ syncpool.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
PushMetricsVaryingMetrics-2 676.0m ± 5% 685.0m ± 8% ~ (p=0.240 n=6) 682.9m ± 6% ~ (p=0.310 n=6)
│ main.txt │ withoutState.txt │ syncpool.txt │
│ B/op │ B/op vs base │ B/op vs base │
PushMetricsVaryingMetrics-2 431.0Mi ± 0% 431.2Mi ± 0% ~ (p=0.937 n=6) 430.9Mi ± 0% ~ (p=0.132 n=6)
│ main.txt │ withoutState.txt │ syncpool.txt │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
PushMetricsVaryingMetrics-2 4.334M ± 0% 4.334M ± 0% ~ (p=1.000 n=6) ¹ 4.334M ± 0% +0.00% (p=0.002 n=6)
¹ all samples are equal
Individual benchmark runsMain:
Alternative 1 (#36600):
This branch:
|
Seems like a benchmarking problem. Lets try to work with the original author in the other PR before we choose a path forward. |
That's for sure! I'm struggling to understand why the results look that way :/ |
Signed-off-by: Arthur Silva Sens <[email protected]> (cherry picked from commit b5a9c1d)
Okay, I found the problem. I'm creating a large request by creating an OTel metric with millions of Metrics with similar names and attributes within the same ResourceMetrics. This ends up in a very small If we go a bit further and look at the code of opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/helper.go Lines 47 to 54 in d72bbd2
That happens because To progress the benchmark here, even with the bug, we can diversify the attributes in the original OTel metrics during the benchmark preparation, leading to millions of entries in tsMap. I can even decrease the amount of series again and it still works :) |
Alright, new results comparing arthursens$ benchstat main.txt withoutState.txt syncpool.txt
goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter
cpu: Apple M2 Pro
│ main.txt │ withoutState.txt │ syncpool.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
PushMetricsVaryingMetrics-2 8.066m ± 5% 13.821m ± 9% +71.36% (p=0.002 n=6) 8.316m ± 6% ~ (p=0.065 n=6)
│ main.txt │ withoutState.txt │ syncpool.txt │
│ B/op │ B/op vs base │ B/op vs base │
PushMetricsVaryingMetrics-2 5.216Mi ± 0% 34.436Mi ± 0% +560.17% (p=0.002 n=6) 5.548Mi ± 0% +6.36% (p=0.002 n=6)
│ main.txt │ withoutState.txt │ syncpool.txt │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
PushMetricsVaryingMetrics-2 56.02k ± 0% 56.05k ± 0% ~ (p=0.721 n=6) 56.04k ± 0% ~ (p=0.665 n=6) It shows that introducing a pool of states is the better option here, removing the state would introduce a big performance degradation. |
Signed-off-by: Arthur Silva Sens <[email protected]>
Signed-off-by: Arthur Silva Sens <[email protected]>
I believe the "Check codeowners" CI is not related to the changes here, right? |
… batch state (open-telemetry#36601) #### Description This is an alternative for open-telemetry#36524 and open-telemetry#36600 This PR does a couple of things: * Add a test written by @edma2 that shows a data race to the batch state when running multiple consumers. * Add a benchmark for PushMetrics, with options to run with a stable number of metrics or varying metrics. * Fix the data race by introducing a `sync.Pool` of batch states. #### Benchmark results results comparing `main`, open-telemetry#36600 and this PR: ```console arthursens$ benchstat main.txt withoutState.txt syncpool.txt goos: darwin goarch: arm64 pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter cpu: Apple M2 Pro │ main.txt │ withoutState.txt │ syncpool.txt │ │ sec/op │ sec/op vs base │ sec/op vs base │ PushMetricsVaryingMetrics-2 8.066m ± 5% 13.821m ± 9% +71.36% (p=0.002 n=6) 8.316m ± 6% ~ (p=0.065 n=6) │ main.txt │ withoutState.txt │ syncpool.txt │ │ B/op │ B/op vs base │ B/op vs base │ PushMetricsVaryingMetrics-2 5.216Mi ± 0% 34.436Mi ± 0% +560.17% (p=0.002 n=6) 5.548Mi ± 0% +6.36% (p=0.002 n=6) │ main.txt │ withoutState.txt │ syncpool.txt │ │ allocs/op │ allocs/op vs base │ allocs/op vs base │ PushMetricsVaryingMetrics-2 56.02k ± 0% 56.05k ± 0% ~ (p=0.721 n=6) 56.04k ± 0% ~ (p=0.665 n=6) ``` --------- Signed-off-by: Arthur Silva Sens <[email protected]>
Description
This is an alternative for #36524 and #36600
This PR does a couple of things:
sync.Pool
of batch states.Benchmark results
results comparing
main
, #36600 and this PR: