Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809

Merged
merged 5 commits into from
Jul 1, 2023

Conversation

xujiangfeng001
Copy link
Contributor

What changes were proposed in this pull request

Issue Number: close #2423

Brief change log

For details, please refer to: #2423 (comment).

This PR only completes the server part, and the client configuration alert will be completed later.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): (no)

@xujiangfeng001
Copy link
Contributor Author

Hi @wolfboys @RocMarshal, PTAL.

@RocMarshal
Copy link
Contributor

Hi @wolfboys @RocMarshal, PTAL.

@xujiangfeng001 Thanks for the contribution. I'll check it ASAP.

Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xujiangfeng001 , sorry for the late response.
Looks good to me on the whole.
I left a few of comments, please let me know what's your opinion.
Thank you.

Comment on lines +61 to +63
add column "start_time" timestamp(6) collate "pg_catalog"."default",
add column "end_time" timestamp(6) collate "pg_catalog"."default",
add column "alert_id" int8 collate "pg_catalog"."default";
Copy link
Contributor

@RocMarshal RocMarshal Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to split it into two statements based on PR granularity here?
CC @wolfboys

Comment on lines +48 to +50
add column `start_time` datetime default null comment 'start time',
add column `end_time` datetime default null comment 'end time',
add column `alert_id` bigint default null comment 'alert id';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to split it into two statements based on PR granularity here?
CC @wolfboys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that these three statements are all related to the granularity of the flink cluster alarm, so I do not think it is necessary to divide them into multiple statements. If I am wrong, please correct me. CC @wolfboys

Comment on lines 136 to 143
<select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
SELECT
count(1)
FROM t_flink_app
WHERE flink_cluster_id = #{clusterId}
limit 1
</select>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we need to filter the status of the job here ?

I sorted out briefly the background and logic.
Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?

  • We'd better to get a better select-id name here.

Please correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have carefully considered here and it is indeed necessary to filter the status.

I want to filter the job status that is not add or cancelled. I need to explain why it is necessary to filter out tasks that are not add or cancelled:

Because during the execution of this SQL statement, jobs in other states can be considered running or preparing to run in the flink cluster, but it may be due to the issue of two scheduling threads being out of sync,unable to update the job status in a timely manner, it may not be possible to determine the affected jobs based on a certain status.

What do you think of getAffectedJobsByClusterId regarding select id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we need to filter the status of the job here ?

I sorted out briefly the background and logic. Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?

  • We'd better to get a better select-id name here.

Please correct me if I'm wrong.

Hi @RocMarshal @wolfboys , Regarding this issue, I found in the implementation that due to the asynchronous monitoring of the application and flink cluster threads, it is not possible to directly determine whether it is an affected job based on the state in the application. I may be looking for a more suitable implementation method. I plan to maintain the original logic regarding this PR. If it is a job deployed in the flink cluster, it will be defined as an affected job regardless of its status.

I look forward to your suggestions and responses very much.

@xujiangfeng001
Copy link
Contributor Author

Hi @RocMarshal, Thank you very much for your review and look forward to your reply.

Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the comments~

@xujiangfeng001 Would you mind changing the head title [Feature][WIP][Service] flink cluster failure alarm&failover like [Issue-xxx][Feature] xxx ?

I notice here's conflict when rebasing the dev branch, would you like to resolve it before the next review ?
Thank you~

Hi, @MonsterChenzhuo Could you help to check the alarm logic on k8s mode ? thx so much.

@xujiangfeng001 xujiangfeng001 changed the title [Feature][WIP][Service] flink cluster failure alarm&failover [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover Jun 29, 2023
@xujiangfeng001
Copy link
Contributor Author

Thx for the comments~

@xujiangfeng001 Would you mind changing the head title [Feature][WIP][Service] flink cluster failure alarm&failover like [Issue-xxx][Feature] xxx ?

I notice here's conflict when rebasing the dev branch, would you like to resolve it before the next review ? Thank you~

Hi, @MonsterChenzhuo Could you help to check the alarm logic on k8s mode ? thx so much.

Hi @RocMarshal , thank you for your reply. I will finish modifying the code and handling conflicts before the next review.

@wolfboys
Copy link
Member

wolfboys commented Jul 1, 2023

Overall it looks good, I'll merged this pr first, there are still some minor problems, We can re-submit pr for improvement.

@wolfboys wolfboys merged commit 72207f4 into apache:dev Jul 1, 2023
@xujiangfeng001 xujiangfeng001 deleted the ISSUE-2423 branch July 17, 2023 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] flink cluster failure alarm&failover
3 participants