From 8942ea6734064295db9a16f04cc6e41eafad1132 Mon Sep 17 00:00:00 2001 From: CharlesCheung Date: Thu, 5 Dec 2024 19:59:18 +0800 Subject: [PATCH 1/5] fix alert rules --- ticdc/ticdc-alert-rules.md | 57 ++++++++++---------------------------- 1 file changed, 15 insertions(+), 42 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 855819fcc8ae9..29f1e40e4c6aa 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -54,20 +54,6 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr This alert is similar to replication interruption. See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). -### `ticdc_processor_exit_with_error_count` - -- Alert rule: - - `changes(ticdc_processor_exit_with_error_count[1m]) > 0` - -- Description: - - A replication task reports an error and exits. - -- Solution: - - See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). - ## Warning alerts Warning alerts are a reminder for an issue or error. @@ -86,61 +72,48 @@ Warning alerts are a reminder for an issue or error. Collect TiCDC logs to locate the root cause. -### `cdc_sink_flush_duration_time_more_than_10s` +### `cdc_no_owner` - Alert rule: - - `histogram_quantile(0.9, rate(ticdc_sink_txn_worker_flush_duration[1m])) > 10` + + `sum(rate(ticdc_owner_ownership_counter[240s])) < 0.5` - Description: - - It takes a replication task more than 10 seconds to write data to the downstream database. + + There is no owner in the TiCDC cluster for more than 10 minutes. - Solution: - Check whether there are problems in the downstream database. - -### `cdc_processor_checkpoint_tso_no_change_for_1m` - -- Alert rule: - - `changes(ticdc_processor_checkpoint_ts[1m]) < 1` - -- Description: - - A replication task has not advanced for more than 1 minute. - -- Solution: + Collect TiCDC logs to locate the root cause. - See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). -### `ticdc_puller_entry_sorter_sort_bucket` +### `ticdc_changefeed_meet_error` - Alert rule: - `histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_sort_bucket{}[1m])) > 1` + `(max_over_time(ticdc_owner_status[1m]) == 1 or max_over_time(ticdc_owner_status[1m]) == 6) > 0` - Description: - - The delay of TiCDC puller entry sorter is too high. + + A replication task encounters an error. - Solution: - Collect TiCDC logs to locate the root cause. + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). -### `ticdc_puller_entry_sorter_merge_bucket` +### `ticdc_processor_exit_with_error_count` - Alert rule: - `histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_merge_bucket{}[1m])) > 1` + `changes(ticdc_processor_exit_with_error_count[1m]) > 0` - Description: - The delay of TiCDC puller entry sorter merge is too high. + A replication task reports an error and exits. - Solution: - Collect TiCDC logs to locate the root cause. + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ### `tikv_cdc_min_resolved_ts_no_change_for_1m` From 3b8bef727f56a84bbb0de929cef03059ec0a8197 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Thu, 5 Dec 2024 20:37:44 +0800 Subject: [PATCH 2/5] add `ticdc_sink_execution_error` --- ticdc/ticdc-alert-rules.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 29f1e40e4c6aa..1395be07aaae9 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -143,15 +143,15 @@ Warning alerts are a reminder for an issue or error. Collect TiCDC monitoring metrics and TiKV logs to locate the root cause. -### `ticdc_sink_mysql_execution_error` +### `ticdc_sink_execution_error` - Alert rule: - `changes(ticdc_sink_mysql_execution_error[1m]) > 0` + `changes(ticdc_sink_execution_error[1m]) > 0` - Description: - An error occurs when a replication task writes data to the downstream MySQL. + An error occurs when a replication task writes data to the downstream. - Solution: From 32c39155841e196869fe6c71b13d17282552207a Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 6 Dec 2024 09:23:30 +0800 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Grace Cai --- ticdc/ticdc-alert-rules.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 1395be07aaae9..b0e5903cf979d 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -75,16 +75,14 @@ Warning alerts are a reminder for an issue or error. ### `cdc_no_owner` - Alert rule: - `sum(rate(ticdc_owner_ownership_counter[240s])) < 0.5` - Description: - There is no owner in the TiCDC cluster for more than 10 minutes. - Solution: - Collect TiCDC logs to locate the root cause. + Collect TiCDC logs to identify the root cause. ### `ticdc_changefeed_meet_error` @@ -94,7 +92,6 @@ Warning alerts are a reminder for an issue or error. `(max_over_time(ticdc_owner_status[1m]) == 1 or max_over_time(ticdc_owner_status[1m]) == 6) > 0` - Description: - A replication task encounters an error. - Solution: From 6a173dd9cf7c9befde53c3785c854e18b9b5eb55 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 6 Dec 2024 09:39:46 +0800 Subject: [PATCH 4/5] Update ticdc/ticdc-alert-rules.md --- ticdc/ticdc-alert-rules.md | 1 - 1 file changed, 1 deletion(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index b0e5903cf979d..9beb1a7c3f593 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -84,7 +84,6 @@ Warning alerts are a reminder for an issue or error. Collect TiCDC logs to identify the root cause. - ### `ticdc_changefeed_meet_error` - Alert rule: From 9da4dc7eea054ceba0124ebab41166bc0e5e1ac6 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 6 Dec 2024 09:43:15 +0800 Subject: [PATCH 5/5] Update ticdc-alert-rules.md --- ticdc/ticdc-alert-rules.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 9beb1a7c3f593..5526eee3ddd69 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -75,9 +75,11 @@ Warning alerts are a reminder for an issue or error. ### `cdc_no_owner` - Alert rule: + `sum(rate(ticdc_owner_ownership_counter[240s])) < 0.5` - Description: + There is no owner in the TiCDC cluster for more than 10 minutes. - Solution: @@ -91,6 +93,7 @@ Warning alerts are a reminder for an issue or error. `(max_over_time(ticdc_owner_status[1m]) == 1 or max_over_time(ticdc_owner_status[1m]) == 6) > 0` - Description: + A replication task encounters an error. - Solution: