[Feature] flink cluster failure alarm&failover #2423

xujiangfeng001 · 2023-03-10T01:52:00Z

Search before asking

I had searched in the feature and found no similar feature requirement.

Description

a. If cluster shutdown or lost is detected, an alarm will be sent.
b. If the job is running on the cluster, the job will alarm in batches. At this time, it is necessary to prevent the job from alarming.

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

xujiangfeng001 · 2023-05-06T06:44:11Z

Hello everyone, after discussion, our solution for this issue is as follows:
Requirement:

When Flink Cluster encounters an exception and is unable to run, alert the user and block the job alert notification running in the Flink Cluster

Detailed logic:

Flink Cluster implements mentality detection and status updates, as detailed in: [ISSUE-2498][Feature] [SubTask] The cluster supports remote and yarn session heartbeat monitoring #2675
When an exception occurs in a job, it is necessary to determine whether the job deployment mode is remote, yarn session or k8s session：If not, send the job alarm directly. If so, obtain the flink cluster status through the Flink Cluster ID of the job:If the flink cluster status is STOP or LOST, block the job alarm and wait for the flink cluster alarm.If the status of flink cluster is RUNNING, actively trigger a flink cluster status update request to update the relevant status of flink cluster. If flink cluster is updated to STOP or LOST status in the latest update, the job alarm will be blocked; If the flink cluster status is still RUNNING, send an alarm notification for the job.
Flink cluster alarm template uses job alarm template and adds information: number of affected jobs.
Abstract the alarm template code to avoid code redundancy issues.

xujiangfeng001 mentioned this issue Mar 10, 2023

[Feature] flink cluster new status tracking #2425

Open

7 tasks

xujiangfeng001 mentioned this issue Jun 20, 2023

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

Merged

wolfboys closed this as completed in #2809 Jul 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] flink cluster failure alarm&failover #2423

[Feature] flink cluster failure alarm&failover #2423

xujiangfeng001 commented Mar 10, 2023

xujiangfeng001 commented May 6, 2023

[Feature] flink cluster failure alarm&failover #2423

[Feature] flink cluster failure alarm&failover #2423

Comments

xujiangfeng001 commented Mar 10, 2023

Search before asking

Description

Usage Scenario

Related issues

Are you willing to submit a PR?

Code of Conduct

xujiangfeng001 commented May 6, 2023