[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

xujiangfeng001 · 2023-06-20T16:03:58Z

What changes were proposed in this pull request

Issue Number: close #2423

Brief change log

For details, please refer to: #2423 (comment).

This PR only completes the server part, and the client configuration alert will be completed later.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts

Dependencies (does it add or upgrade a dependency): (no)

xujiangfeng001 · 2023-06-21T01:26:26Z

Hi @wolfboys @RocMarshal, PTAL.

RocMarshal · 2023-06-21T02:10:42Z

Hi @wolfboys @RocMarshal, PTAL.

@xujiangfeng001 Thanks for the contribution. I'll check it ASAP.

RocMarshal

Hi, @xujiangfeng001 , sorry for the late response.
Looks good to me on the whole.
I left a few of comments, please let me know what's your opinion.
Thank you.

RocMarshal · 2023-06-26T14:39:55Z

streampark-console/streampark-console-service/src/main/assembly/script/upgrade/pgsql/2.2.0.sql

+    add column "start_time" timestamp(6) collate "pg_catalog"."default",
+    add column "end_time" timestamp(6) collate "pg_catalog"."default",
+    add column "alert_id" int8 collate "pg_catalog"."default";


Do we need to split it into two statements based on PR granularity here?
CC @wolfboys

RocMarshal · 2023-06-26T14:40:24Z

streampark-console/streampark-console-service/src/main/assembly/script/upgrade/mysql/2.2.0.sql

+    add column `start_time` datetime default null comment 'start time',
+    add column `end_time` datetime default null comment 'end time',
+    add column `alert_id` bigint default null comment 'alert id';


Do we need to split it into two statements based on PR granularity here?
CC @wolfboys

I believe that these three statements are all related to the granularity of the flink cluster alarm, so I do not think it is necessary to divide them into multiple statements. If I am wrong, please correct me. CC @wolfboys

RocMarshal · 2023-06-26T15:21:04Z

...park-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml

+    <select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
+        SELECT
+            count(1)
+        FROM t_flink_app
+        WHERE flink_cluster_id = #{clusterId}
+            limit 1
+    </select>
+


Do we need to filter the status of the job here ?

I sorted out briefly the background and logic.
Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?

We'd better to get a better select-id name here.

Please correct me if I'm wrong.

I have carefully considered here and it is indeed necessary to filter the status.

I want to filter the job status that is not add or cancelled. I need to explain why it is necessary to filter out tasks that are not add or cancelled:

Because during the execution of this SQL statement, jobs in other states can be considered running or preparing to run in the flink cluster, but it may be due to the issue of two scheduling threads being out of sync,unable to update the job status in a timely manner, it may not be possible to determine the affected jobs based on a certain status.

What do you think of getAffectedJobsByClusterId regarding select id.

That sounds good~

Do we need to filter the status of the job here ?

I sorted out briefly the background and logic. Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?

We'd better to get a better select-id name here.

Please correct me if I'm wrong.

Hi @RocMarshal @wolfboys , Regarding this issue, I found in the implementation that due to the asynchronous monitoring of the application and flink cluster threads, it is not possible to directly determine whether it is an affected job based on the state in the application. I may be looking for a more suitable implementation method. I plan to maintain the original logic regarding this PR. If it is a job deployed in the flink cluster, it will be defined as an affected job regardless of its status.

I look forward to your suggestions and responses very much.

xujiangfeng001 · 2023-06-27T16:04:54Z

Hi @RocMarshal, Thank you very much for your review and look forward to your reply.

RocMarshal

Thx for the comments~

@xujiangfeng001 Would you mind changing the head title [Feature][WIP][Service] flink cluster failure alarm&failover like [Issue-xxx][Feature] xxx ?

I notice here's conflict when rebasing the dev branch, would you like to resolve it before the next review ?
Thank you~

Hi, @MonsterChenzhuo Could you help to check the alarm logic on k8s mode ? thx so much.

xujiangfeng001 · 2023-06-29T03:28:34Z

Thx for the comments~

@xujiangfeng001 Would you mind changing the head title [Feature][WIP][Service] flink cluster failure alarm&failover like [Issue-xxx][Feature] xxx ?

I notice here's conflict when rebasing the dev branch, would you like to resolve it before the next review ? Thank you~

Hi, @MonsterChenzhuo Could you help to check the alarm logic on k8s mode ? thx so much.

Hi @RocMarshal , thank you for your reply. I will finish modifying the code and handling conflicts before the next review.

wolfboys · 2023-07-01T09:41:57Z

Overall it looks good, I'll merged this pr first, there are still some minor problems, We can re-submit pr for improvement.

[Feature] flink cluster failure alarm&failover

41b8bac

github-actions bot added BACKEND BUILD labels Jun 20, 2023

Merge branch 'dev' into ISSUE-2423

5b1efde

[Feature] flink cluster failure alarm&failover

3d1c923

RocMarshal reviewed Jun 26, 2023

View reviewed changes

RocMarshal reviewed Jun 28, 2023

View reviewed changes

xujiangfeng001 changed the title ~~[Feature][WIP][Service] flink cluster failure alarm&failover~~ [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover Jun 29, 2023

xujiangfeng001 and others added 2 commits July 1, 2023 15:43

Merge branch 'dev' into ISSUE-2423

2657605

[Feature] flink cluster failure alarm&failover

9c4c008

wolfboys approved these changes Jul 1, 2023

View reviewed changes

wolfboys merged commit 72207f4 into apache:dev Jul 1, 2023

RocMarshal mentioned this pull request Jul 2, 2023

[Improve] Flink cluster status monitoring improvement #2826

Merged

xujiangfeng001 deleted the ISSUE-2423 branch July 17, 2023 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

xujiangfeng001 commented Jun 20, 2023

xujiangfeng001 commented Jun 21, 2023

RocMarshal commented Jun 21, 2023

RocMarshal left a comment

RocMarshal Jun 26, 2023 •

edited

Loading

RocMarshal Jun 26, 2023

xujiangfeng001 Jun 27, 2023

RocMarshal Jun 26, 2023

xujiangfeng001 Jun 27, 2023

RocMarshal Jun 28, 2023

xujiangfeng001 Jul 1, 2023

xujiangfeng001 commented Jun 27, 2023

RocMarshal left a comment

xujiangfeng001 commented Jun 29, 2023

wolfboys commented Jul 1, 2023

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

Conversation

xujiangfeng001 commented Jun 20, 2023

What changes were proposed in this pull request

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts

xujiangfeng001 commented Jun 21, 2023

RocMarshal commented Jun 21, 2023

RocMarshal left a comment

Choose a reason for hiding this comment

RocMarshal Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

RocMarshal Jun 26, 2023

Choose a reason for hiding this comment

xujiangfeng001 Jun 27, 2023

Choose a reason for hiding this comment

RocMarshal Jun 26, 2023

Choose a reason for hiding this comment

xujiangfeng001 Jun 27, 2023

Choose a reason for hiding this comment

RocMarshal Jun 28, 2023

Choose a reason for hiding this comment

xujiangfeng001 Jul 1, 2023

Choose a reason for hiding this comment

xujiangfeng001 commented Jun 27, 2023

RocMarshal left a comment

Choose a reason for hiding this comment

xujiangfeng001 commented Jun 29, 2023

wolfboys commented Jul 1, 2023

RocMarshal Jun 26, 2023 •

edited

Loading