Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Add alert to notify about duplicate sample/metric ingestion #1688

Merged
merged 2 commits into from
Nov 17, 2022

Conversation

arajkumar
Copy link
Member

@arajkumar arajkumar commented Oct 11, 2022

Description

This PR does the following,

  1. Merge duplicate reporter into throughput reporter
  2. Add alert about duplicate sample/metric ingestion
  3. Add an e2e test to verify metrics related to duplicates are populated
  4. Rename gauge metrics which ends with total to make linter happy [1]

[1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30

Signed-off-by: Arunprasad Rajkumar [email protected]

Fixes #1687

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

  • CHANGELOG entry for user-facing changes
  • Updated the relevant documentation

@arajkumar arajkumar self-assigned this Oct 11, 2022
@arajkumar arajkumar requested review from a team as code owners October 11, 2022 09:39
@arajkumar arajkumar force-pushed the add-temp-table-metrics branch from b851545 to 6edc0e9 Compare October 11, 2022 09:40


## Diagnosis
1. If Prometheus is running in HA mode, go to [Prometheus high availability](#prometheus-high-availability)
Copy link
Contributor

@alejandrodnm alejandrodnm Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about an specific entry in diagnosis for prometheus HA running without the labels:

When running a Prometheus HA deployment there's an increase in ingest due to duplicate data being sent to Promscale. We can view the ingest of duplicates with:

sum by(job, instance, type) (
          rate(promscale_ingest_duplicates_total{kind="sample"}[5m])
      )

This could happen if the Prometheus HA deployment is not configured to decorate the samples with the metadata from the replica that's pushing the data. In this scenario, two or more Prometheus replicas from the same cluster will be sending the exact same datapoints, and since there's no cluster/replica metadata, Promscale doesn't have the information needed to just accept the data from one of them and will try to persist them all.


func registerDuplicates(duplicateSamples int64) {
metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "sample"}).Add(float64(duplicateSamples))
metrics.IngestorDuplicates.With(prometheus.Labels{"type": "metric", "kind": "writes_to_db"}).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this series with kind=writes_to_db is not on the refactored code. Was it left out on purpose?

Copy link
Member Author

@arajkumar arajkumar Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!
Yes, it is just a duplicate of kind="metrics".

@arajkumar arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from 2e644a7 to 883b9ea Compare October 11, 2022 12:17
rate(promscale_ingest_duplicates_total{kind="sample"}[5m])
```

If more data points are seen as a result of the above query, follow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is right. It sounds like duplicates happen when using HA improperly. While that is a case, majority of duplicates are due to timeout for Prometheus remote-write, and not HA.

I will suggest to have something like this here

If you see high rate of duplicate samples, check Prometheus logs for timeout or batch retry errors. If found, refer to `Tune Prometheus remote-write config`.

Then, we need to decide on a call, what remote-write config should we suggest for mitigation. Lets keep this as standup agenda for next meeting.

After this, we should mention the edge case, which is the HA. This might be

Duplicate samples can also occur on a wrongly configured HA setup. If you are using Prometheus as HA, please refer to `Prometheus high availability` for mitigation.

@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() {
throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...)
}

duplicateSamplesRate := tc.duplicateSamples.Rate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a rate for duplicates? If the reason is for alert, then we can simply to rate of existing duplicates counter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to print the duplicate samples and metrics rate as part of regular throughput log.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn

Having this part of throughput seems like duplicates is a feature :)

Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not a feature, but when promscale users shares a log snippet we can easily find whether they have duplicate ingestion issue without going through the full log. Also dupicate message won't be append during a normal case!

@@ -1774,7 +1774,7 @@
"uid": "${DS_PROMETHEUS}"
},
"exemplar": true,
"expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total",
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$health_check_errors_total]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is $health_check_errors_total defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy/paster error :)

@arajkumar arajkumar force-pushed the add-temp-table-metrics branch from a850473 to c620e2d Compare October 12, 2022 09:32
@@ -1774,7 +1774,7 @@
"uid": "${DS_PROMETHEUS}"
},
"exemplar": true,
"expr": "1 - promscale_sql_database_health_check_errors_total / promscale_sql_database_health_check_total",
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval]",
"expr": "1 - rate(promscale_sql_database_health_check_errors_total[$__rate_interval] / promscale_sql_database_health_check_total[$__rate_interval])",

@paulfantom
Copy link
Contributor

@arajkumar what's the status of this?

@arajkumar
Copy link
Member Author

@arajkumar what's the status of this?

Thanks for the nudge @paulfantom. I will resume this and address @Harkishen-Singh's concerns.

@arajkumar arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from b0a963e to b597f0e Compare October 31, 2022 09:35
@@ -2513,7 +2513,7 @@
},
"editorMode": "code",
"exemplar": false,
"expr": "max(promscale_sql_database_worker_maintenance_job_long_running_total{namespace=~\"$namespace\"})",
"expr": "max(promscale_sql_database_worker_long_running_maintenance_jobs{namespace=~\"$namespace\"})",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from promscale_sql_database_worker_maintenance_job_long_running_total to
promscale_sql_database_worker_long_running_maintenance_jobs and not to promscale_sql_database_worker_maintenance_job_long_running?

IMHO it is better to keep a consistent prefix of promscale_sql_database_worker_maintenance_ to allow easier query (think {__name__=~"promscale_sql_database_worker_maintenance_.*"} as well as easier code navigation as promscale_sql_database_worker_maintenance_job could be defined in the same place regardless of how many metric registries are there.
That said I am not tied to having it either way.

@arajkumar
Copy link
Member Author

@paulfantom @Harkishen-Singh PTAL.

@@ -85,6 +91,12 @@ func (tc *throughputCalc) run() {
throughput = append(throughput, []interface{}{"metric-metadata/sec", int(metadataRate)}...)
}

duplicateSamplesRate := tc.duplicateSamples.Rate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not confident of printing this as regular log. Duplicates are part of "warn" which is why we had the previous way of logging duplicates via a Rate controlled log.Warn

Having this part of throughput seems like duplicates is a feature :)

Maybe this is not a big deal. If anyone else has other thoughts, then please chime in. Otherwise I will just approve.

@arajkumar arajkumar force-pushed the add-temp-table-metrics branch from d8af20d to 2d330ab Compare November 15, 2022 05:57
@arajkumar arajkumar enabled auto-merge (rebase) November 15, 2022 06:25
@arajkumar arajkumar force-pushed the add-temp-table-metrics branch 2 times, most recently from 838b48f to 764c935 Compare November 17, 2022 06:01
This commit does the following,
1. Merge duplicate reporter into throughput reporter
2. Add alert about duplicate sample/metric ingestion
3. Add an e2e test to verify metrics related to duplicates are populated

Signed-off-by: Arunprasad Rajkumar <[email protected]>
Latest mixtool linter which relies on grafana-linter pkg is failing when gauge metrics with name ending with `total`[1].

[1] https://github.com/grafana/dashboard-linter/blob/44d415fb6bdc4d8e6585e514c448174d4de1ff02/lint/rule_target_counter_agg.go#L30

Signed-off-by: Arunprasad Rajkumar <[email protected]>
@arajkumar arajkumar force-pushed the add-temp-table-metrics branch from 764c935 to 6ec101f Compare November 17, 2022 06:02
@arajkumar arajkumar merged commit 6b9a20b into timescale:master Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alert and runbook for duplicate samples
5 participants