Add alert to notify about duplicate sample/metric ingestion #1688

arajkumar · 2022-10-11T09:39:05Z

Description

This PR does the following,

Merge duplicate reporter into throughput reporter
Add alert about duplicate sample/metric ingestion
Add an e2e test to verify metrics related to duplicates are populated
Rename gauge metrics which ends with total to make linter happy [1]

[1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30

Signed-off-by: Arunprasad Rajkumar [email protected]

Fixes #1687

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

CHANGELOG entry for user-facing changes
Updated the relevant documentation

alejandrodnm · 2022-10-11T11:02:31Z

docs/runbooks/PromscaleIngestHighDataDuplication.md

+
+
+## Diagnosis
+1. If Prometheus is running in HA mode, go to [Prometheus high availability](#prometheus-high-availability)


WDYT about an specific entry in diagnosis for prometheus HA running without the labels:

When running a Prometheus HA deployment there's an increase in ingest due to duplicate data being sent to Promscale. We can view the ingest of duplicates with:

sum by(job, instance, type) ( rate(promscale_ingest_duplicates_total{kind="sample"}[5m]) )

This could happen if the Prometheus HA deployment is not configured to decorate the samples with the metadata from the replica that's pushing the data. In this scenario, two or more Prometheus replicas from the same cluster will be sending the exact same datapoints, and since there's no cluster/replica metadata, Promscale doesn't have the information needed to just accept the data from one of them and will try to persist them all.

alejandrodnm · 2022-10-11T11:10:52Z

pkg/pgmodel/ingestor/duplicates.go

-
-func registerDuplicates(duplicateSamples int64) {
-	metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "sample"}).Add(float64(duplicateSamples))
-	metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "writes_to_db"}).Inc()


I see that this series with kind=writes_to_db is not on the refactored code. Was it left out on purpose?

Great catch!
Yes, it is just a duplicate of kind="metrics".

Harkishen-Singh · 2022-10-11T13:28:13Z

docs/runbooks/PromscaleIngestHighDataDuplication.md

+rate(promscale_ingest_duplicates_total{kind="sample"}[5m])
+```
+
+If more data points are seen as a result of the above query, follow


I dont think this is right. It sounds like duplicates happen when using HA improperly. While that is a case, majority of duplicates are due to timeout for Prometheus remote-write, and not HA.

I will suggest to have something like this here

If you see high rate of duplicate samples, check Prometheus logs for timeout or batch retry errors. If found, refer to `Tune Prometheus remote-write config`.

Then, we need to decide on a call, what remote-write config should we suggest for mitigation. Lets keep this as standup agenda for next meeting.

After this, we should mention the edge case, which is the HA. This might be

Duplicate samples can also occur on a wrongly configured HA setup. If you are using Prometheus as HA, please refer to `Prometheus high availability` for mitigation.

Harkishen-Singh · 2022-10-11T13:30:30Z

pkg/util/throughput/throughput.go

@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() {
 			throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...)
 		}

+		duplicateSamplesRate := tc.duplicateSamples.Rate()


Why do we need a rate for duplicates? If the reason is for alert, then we can simply to rate of existing duplicates counter.

This is to print the duplicate samples and metrics rate as part of regular throughput log.

I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn

Having this part of throughput seems like duplicates is a feature :)

Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.

Yeah, it is not a feature, but when promscale users shares a log snippet we can easily find whether they have duplicate ingestion issue without going through the full log. Also dupicate message won't be append during a normal case!

paulfantom · 2022-10-12T08:16:40Z

docs/mixin/dashboards/promscale.json

@@ -1774,7 +1774,7 @@
            "uid": "${DS_PROMETHEUS}"
          },
          "exemplar": true,
-          "expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total",
+          "expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$health_check_errors_total]",


Where is $health_check_errors_total defined?

This looks like a copy/paster error :)

alejandrodnm · 2022-10-21T12:58:14Z

docs/mixin/dashboards/promscale.json

@@ -1774,7 +1774,7 @@
            "uid": "${DS_PROMETHEUS}"
          },
          "exemplar": true,
-          "expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total",
+          "expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]",


Suggested change

"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]",

"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval])",

paulfantom · 2022-10-28T11:32:22Z

@arajkumar what's the status of this?

arajkumar · 2022-10-28T12:32:51Z

@arajkumar what's the status of this?

Thanks for the nudge @paulfantom. I will resume this and address @Harkishen-Singh's concerns.

paulfantom · 2022-10-31T12:15:55Z

docs/mixin/dashboards/promscale.json

@@ -2513,7 +2513,7 @@
              },
              "editorMode": "code",
              "exemplar": false,
-              "expr": "max(promscale_sql_database_worker_maintenance_job_long_running_total{namespace=~\"$namespace\"})",
+              "expr": "max(promscale_sql_database_worker_long_running_maintenance_jobs{namespace=~\"$namespace\"})",


Why change from promscale_sql_database_worker_maintenance_job_long_running_total to
promscale_sql_database_worker_long_running_maintenance_jobs and not to promscale_sql_database_worker_maintenance_job_long_running?

IMHO it is better to keep a consistent prefix of promscale_sql_database_worker_maintenance_ to allow easier query (think {__name__=~"promscale_sql_database_worker_maintenance_.*"} as well as easier code navigation as promscale_sql_database_worker_maintenance_job could be defined in the same place regardless of how many metric registries are there.
That said I am not tied to having it either way.

arajkumar · 2022-11-03T06:33:46Z

@paulfantom @Harkishen-Singh PTAL.

Harkishen-Singh · 2022-11-09T13:06:56Z

pkg/util/throughput/throughput.go

@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() {
 			throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...)
 		}

+		duplicateSamplesRate := tc.duplicateSamples.Rate()


I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn

Having this part of throughput seems like duplicates is a feature :)

Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.

This commit does the following, 1. Merge duplicate reporter into throughput reporter 2. Add alert about duplicate sample/metric ingestion 3. Add an e2e test to verify metrics related to duplicates are populated Signed-off-by: Arunprasad Rajkumar <[email protected]>

Latest mixtool linter which relies on grafana-linter pkg is failing when gauge metrics with name ending with `total`[1]. [1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30 Signed-off-by: Arunprasad Rajkumar <[email protected]>

arajkumar self-assigned this Oct 11, 2022

arajkumar requested review from a team as code owners October 11, 2022 09:39

arajkumar force-pushed the add-temp-table-metrics branch from b851545 to 6edc0e9 Compare October 11, 2022 09:40

arajkumar requested review from niksajakovljevic and alejandrodnm October 11, 2022 09:40

alejandrodnm reviewed Oct 11, 2022

View reviewed changes

arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from 2e644a7 to 883b9ea Compare October 11, 2022 12:17

arajkumar mentioned this pull request Oct 11, 2022

promscale: Fix prometheus HA mode flag timescale/docs#1688

Merged

11 tasks

arajkumar requested a review from alejandrodnm October 11, 2022 12:58

alejandrodnm approved these changes Oct 11, 2022

View reviewed changes

Harkishen-Singh suggested changes Oct 11, 2022

View reviewed changes

paulfantom reviewed Oct 12, 2022

View reviewed changes

arajkumar force-pushed the add-temp-table-metrics branch from a850473 to c620e2d Compare October 12, 2022 09:32

alejandrodnm reviewed Oct 21, 2022

View reviewed changes

arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from b0a963e to b597f0e Compare October 31, 2022 09:35

paulfantom reviewed Oct 31, 2022

View reviewed changes

arajkumar force-pushed the add-temp-table-metrics branch from b597f0e to 4733835 Compare October 31, 2022 12:56

arajkumar assigned niksajakovljevic and Harkishen-Singh Oct 31, 2022

arajkumar force-pushed the add-temp-table-metrics branch from 4733835 to 1664d74 Compare October 31, 2022 12:58

arajkumar requested review from Harkishen-Singh and paulfantom November 3, 2022 06:33

Harkishen-Singh approved these changes Nov 9, 2022

View reviewed changes

paulfantom approved these changes Nov 14, 2022

View reviewed changes

arajkumar force-pushed the add-temp-table-metrics branch from d8af20d to 2d330ab Compare November 15, 2022 05:57

arajkumar enabled auto-merge (rebase) November 15, 2022 06:25

arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from 838b48f to 764c935 Compare November 17, 2022 06:01

arajkumar added 2 commits November 17, 2022 11:32

arajkumar force-pushed the add-temp-table-metrics branch from 764c935 to 6ec101f Compare November 17, 2022 06:02

arajkumar merged commit 6b9a20b into timescale:master Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alert to notify about duplicate sample/metric ingestion #1688

Add alert to notify about duplicate sample/metric ingestion #1688

arajkumar commented Oct 11, 2022 •

edited

Loading

alejandrodnm Oct 11, 2022 •

edited

Loading

alejandrodnm Oct 11, 2022

arajkumar Oct 11, 2022 •

edited

Loading

Harkishen-Singh Oct 11, 2022

Harkishen-Singh Oct 11, 2022

arajkumar Oct 12, 2022

Harkishen-Singh Nov 9, 2022

arajkumar Nov 14, 2022

paulfantom Oct 12, 2022

arajkumar Oct 12, 2022

alejandrodnm Oct 21, 2022

paulfantom commented Oct 28, 2022

arajkumar commented Oct 28, 2022

paulfantom Oct 31, 2022

arajkumar commented Nov 3, 2022

Harkishen-Singh Nov 9, 2022



		## Diagnosis
		1. If Prometheus is running in HA mode, go to [Prometheus high availability](#prometheus-high-availability)

	"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]",
	"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval])",

Add alert to notify about duplicate sample/metric ingestion #1688

Add alert to notify about duplicate sample/metric ingestion #1688

Conversation

arajkumar commented Oct 11, 2022 • edited Loading

Description

Merge requirements

alejandrodnm Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arajkumar Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulfantom commented Oct 28, 2022

arajkumar commented Oct 28, 2022

Choose a reason for hiding this comment

arajkumar commented Nov 3, 2022

Choose a reason for hiding this comment

arajkumar commented Oct 11, 2022 •

edited

Loading

alejandrodnm Oct 11, 2022 •

edited

Loading

arajkumar Oct 11, 2022 •

edited

Loading