-
Notifications
You must be signed in to change notification settings - Fork 169
Add alert to notify about duplicate sample/metric ingestion #1688
Conversation
b851545
to
6edc0e9
Compare
|
||
|
||
## Diagnosis | ||
1. If Prometheus is running in HA mode, go to [Prometheus high availability](#prometheus-high-availability) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about an specific entry in diagnosis for prometheus HA running without the labels:
When running a Prometheus HA deployment there's an increase in ingest due to duplicate data being sent to Promscale. We can view the ingest of duplicates with:
sum by(job, instance, type) (
rate(promscale_ingest_duplicates_total{kind="sample"}[5m])
)
This could happen if the Prometheus HA deployment is not configured to decorate the samples with the metadata from the replica that's pushing the data. In this scenario, two or more Prometheus replicas from the same cluster will be sending the exact same datapoints, and since there's no cluster/replica metadata, Promscale doesn't have the information needed to just accept the data from one of them and will try to persist them all.
|
||
func registerDuplicates(duplicateSamples int64) { | ||
metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "sample"}).Add(float64(duplicateSamples)) | ||
metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "writes_to_db"}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this series with kind=writes_to_db
is not on the refactored code. Was it left out on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
Yes, it is just a duplicate of kind="metrics"
.
2e644a7
to
883b9ea
Compare
rate(promscale_ingest_duplicates_total{kind="sample"}[5m]) | ||
``` | ||
|
||
If more data points are seen as a result of the above query, follow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is right. It sounds like duplicates happen when using HA improperly. While that is a case, majority of duplicates are due to timeout for Prometheus remote-write, and not HA.
I will suggest to have something like this here
If you see high rate of duplicate samples, check Prometheus logs for timeout or batch retry errors. If found, refer to `Tune Prometheus remote-write config`.
Then, we need to decide on a call, what remote-write config should we suggest for mitigation. Lets keep this as standup agenda for next meeting.
After this, we should mention the edge case, which is the HA. This might be
Duplicate samples can also occur on a wrongly configured HA setup. If you are using Prometheus as HA, please refer to `Prometheus high availability` for mitigation.
@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() { | |||
throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...) | |||
} | |||
|
|||
duplicateSamplesRate := tc.duplicateSamples.Rate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a rate for duplicates? If the reason is for alert, then we can simply to rate of existing duplicates counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to print the duplicate samples and metrics rate as part of regular throughput log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn
Having this part of throughput seems like duplicates is a feature :)
Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is not a feature, but when promscale users shares a log snippet we can easily find whether they have duplicate ingestion issue without going through the full log. Also dupicate message won't be append during a normal case!
docs/mixin/dashboards/promscale.json
Outdated
@@ -1774,7 +1774,7 @@ | |||
"uid": "${DS_PROMETHEUS}" | |||
}, | |||
"exemplar": true, | |||
"expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total", | |||
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$health_check_errors_total]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is $health_check_errors_total
defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a copy/paster error :)
a850473
to
c620e2d
Compare
docs/mixin/dashboards/promscale.json
Outdated
@@ -1774,7 +1774,7 @@ | |||
"uid": "${DS_PROMETHEUS}" | |||
}, | |||
"exemplar": true, | |||
"expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total", | |||
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]", | |
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval])", |
@arajkumar what's the status of this? |
Thanks for the nudge @paulfantom. I will resume this and address @Harkishen-Singh's concerns. |
b0a963e
to
b597f0e
Compare
docs/mixin/dashboards/promscale.json
Outdated
@@ -2513,7 +2513,7 @@ | |||
}, | |||
"editorMode": "code", | |||
"exemplar": false, | |||
"expr": "max(promscale_sql_database_worker_maintenance_job_long_running_total{namespace=~\"$namespace\"})", | |||
"expr": "max(promscale_sql_database_worker_long_running_maintenance_jobs{namespace=~\"$namespace\"})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change from promscale_sql_database_worker_maintenance_job_long_running_total
to
promscale_sql_database_worker_long_running_maintenance_jobs
and not to promscale_sql_database_worker_maintenance_job_long_running
?
IMHO it is better to keep a consistent prefix of promscale_sql_database_worker_maintenance_
to allow easier query (think {__name__=~"promscale_sql_database_worker_maintenance_.*"}
as well as easier code navigation as promscale_sql_database_worker_maintenance_job
could be defined in the same place regardless of how many metric registries are there.
That said I am not tied to having it either way.
b597f0e
to
4733835
Compare
4733835
to
1664d74
Compare
@paulfantom @Harkishen-Singh PTAL. |
@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() { | |||
throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...) | |||
} | |||
|
|||
duplicateSamplesRate := tc.duplicateSamples.Rate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn
Having this part of throughput seems like duplicates is a feature :)
Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.
d8af20d
to
2d330ab
Compare
838b48f
to
764c935
Compare
This commit does the following, 1. Merge duplicate reporter into throughput reporter 2. Add alert about duplicate sample/metric ingestion 3. Add an e2e test to verify metrics related to duplicates are populated Signed-off-by: Arunprasad Rajkumar <[email protected]>
Latest mixtool linter which relies on grafana-linter pkg is failing when gauge metrics with name ending with `total`[1]. [1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30 Signed-off-by: Arunprasad Rajkumar <[email protected]>
764c935
to
6ec101f
Compare
Description
This PR does the following,
total
to make linter happy [1][1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30
Signed-off-by: Arunprasad Rajkumar [email protected]
Fixes #1687
Merge requirements
Please take into account the following non-code changes that you may need to make with your PR: