-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover #2809
Conversation
Hi @wolfboys @RocMarshal, PTAL. |
@xujiangfeng001 Thanks for the contribution. I'll check it ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @xujiangfeng001 , sorry for the late response.
Looks good to me on the whole.
I left a few of comments, please let me know what's your opinion.
Thank you.
add column "start_time" timestamp(6) collate "pg_catalog"."default", | ||
add column "end_time" timestamp(6) collate "pg_catalog"."default", | ||
add column "alert_id" int8 collate "pg_catalog"."default"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to split it into two statements based on PR granularity here?
CC @wolfboys
add column `start_time` datetime default null comment 'start time', | ||
add column `end_time` datetime default null comment 'end time', | ||
add column `alert_id` bigint default null comment 'alert id'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to split it into two statements based on PR granularity here?
CC @wolfboys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that these three statements are all related to the granularity of the flink cluster
alarm, so I do not think it is necessary to divide them into multiple statements. If I am wrong, please correct me. CC @wolfboys
<select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long"> | ||
SELECT | ||
count(1) | ||
FROM t_flink_app | ||
WHERE flink_cluster_id = #{clusterId} | ||
limit 1 | ||
</select> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we need to filter the status of the job here ?
I sorted out briefly the background and logic.
Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?
- We'd better to get a better select-id name here.
Please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have carefully considered here and it is indeed necessary to filter the status.
I want to filter the job status that is not add
or cancelled
. I need to explain why it is necessary to filter out tasks that are not add
or cancelled
:
Because during the execution of this SQL statement, jobs in other states can be considered running or preparing to run in the flink cluster
, but it may be due to the issue of two scheduling threads being out of sync,unable to update the job status in a timely manner, it may not be possible to determine the affected jobs based on a certain status.
What do you think of getAffectedJobsByClusterId
regarding select id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we need to filter the status of the job here ?
I sorted out briefly the background and logic. Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?
- We'd better to get a better select-id name here.
Please correct me if I'm wrong.
Hi @RocMarshal @wolfboys , Regarding this issue, I found in the implementation that due to the asynchronous monitoring of the application
and flink cluster
threads, it is not possible to directly determine whether it is an affected job based on the state in the application. I may be looking for a more suitable implementation method. I plan to maintain the original logic regarding this PR. If it is a job deployed in the flink cluster
, it will be defined as an affected job regardless of its status.
I look forward to your suggestions and responses very much.
Hi @RocMarshal, Thank you very much for your review and look forward to your reply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the comments~
@xujiangfeng001 Would you mind changing the head title [Feature][WIP][Service] flink cluster failure alarm&failover
like [Issue-xxx][Feature] xxx
?
I notice here's conflict when rebasing the dev
branch, would you like to resolve it before the next review ?
Thank you~
Hi, @MonsterChenzhuo Could you help to check the alarm logic on k8s
mode ? thx so much.
Hi @RocMarshal , thank you for your reply. I will finish modifying the code and handling conflicts before the next review. |
Overall it looks good, I'll merged this pr first, there are still some minor problems, We can re-submit pr for improvement. |
What changes were proposed in this pull request
Issue Number: close #2423
Brief change log
For details, please refer to: #2423 (comment).
This PR only completes the server part, and the client configuration
alert
will be completed later.Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts