[Improve] Flink cluster status monitoring improvement #2826

wolfboys · 2023-07-01T17:27:36Z

[Improve] Flink cluster status monitoring improvement

wolfboys · 2023-07-02T10:16:51Z

cc @xujiangfeng001 @RocMarshal PTAL, thx

RocMarshal

Thanks for @wolfboys @xujiangfeng001 driving this pr.
I left a few of comments, PTAL in your free time.
Thank you~

RocMarshal · 2023-07-02T14:53:27Z

...nsole-service/src/main/java/org/apache/streampark/console/core/task/FlinkClusterWatcher.java

-        return ClusterState.RUNNING;
-      default:
-        return ClusterState.STOPPED;
+    if (status == FinalApplicationStatus.UNDEFINED) {


When the yarn application was accepted by yarn resourcemanager, the FinalApplicationStatus is UNDEFINED too, but the flink cluster isn't RUNNING

This case was mentioned in #2541 (comment)

When the yarn application was accepted by yarn resourcemanager, the FinalApplicationStatus is UNDEFINED too, but the flink cluster isn't RUNNING

This case was mentioned in #2541 (comment)

I also have questions about this logic and further discussion is needed. @xujiangfeng001

This is my negligence. The corresponding logic was not modified during the implementation of this piece of content. The logic here does not need to determine the finalStatus of the application, only the application state needs to be determined as running to consider the flink cluster as running.

public enum YarnApplicationState { /** Application which was just created. */ NEW, /** Application which is being saved. */ NEW_SAVING, /** Application which has been submitted. */ SUBMITTED, /** Application has been accepted by the scheduler */ ACCEPTED, /** Application which is currently running. */ RUNNING, /** Application which finished successfully. */ FINISHED, /** Application which failed. */ FAILED, /** Application which was terminated by a user or admin. */ KILLED }

RocMarshal · 2023-07-02T14:55:11Z

...park-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml

+    <select id="countJobsByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
        select
            count(1)
        from t_flink_app


Do we need to filter the status of the application here? #2809 (comment)

Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

It is difficult to directly query the data table for this issue. I think we need a cache to record the running jobs of each flink cluster.

xujiangfeng001 · 2023-07-03T02:26:53Z

Hi @wolfboys @RocMarshal ,I left a few of comments. Please take a look at it when you have time.

streampark-console/streampark-console-service/src/main/assembly/script/schema/mysql-schema.sql

xujiangfeng001 · 2023-07-03T02:24:36Z

...park-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml

+    <select id="countJobsByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
        select
            count(1)
        from t_flink_app


Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

It is difficult to directly query the data table for this issue. I think we need a cache to record the running jobs of each flink cluster.

xujiangfeng001 · 2023-07-07T05:47:14Z

...nsole-service/src/main/java/org/apache/streampark/console/core/task/FlinkClusterWatcher.java

-        return ClusterState.RUNNING;
-      default:
-        return ClusterState.STOPPED;
+    if (status == FinalApplicationStatus.UNDEFINED) {


This is my negligence. The corresponding logic was not modified during the implementation of this piece of content. The logic here does not need to determine the finalStatus of the application, only the application state needs to be determined as running to consider the flink cluster as running.

public enum YarnApplicationState { /** Application which was just created. */ NEW, /** Application which is being saved. */ NEW_SAVING, /** Application which has been submitted. */ SUBMITTED, /** Application has been accepted by the scheduler */ ACCEPTED, /** Application which is currently running. */ RUNNING, /** Application which finished successfully. */ FINISHED, /** Application which failed. */ FAILED, /** Application which was terminated by a user or admin. */ KILLED }

wolfboys · 2023-07-10T01:21:24Z

cc @RocMarshal @xujiangfeng001 PTAL

xujiangfeng001 · 2023-07-10T02:46:27Z

...nsole-service/src/main/java/org/apache/streampark/console/core/task/FlinkClusterWatcher.java

-    ClusterState state = getClusterStateFromFlinkAPI(flinkCluster);
-    if (ClusterState.isRunningState(state)) {
+  public ClusterState getClusterState(FlinkCluster flinkCluster) {
+    ClusterState state = FAILED_STATES.getIfPresent(flinkCluster.getId());


I think there is a problem with the logic in this method: here, in Yarn Session mode, it won't go jobmanagerUrl request status, only yarn restful api requests will be processed. This is not in line with our original intention of designing the jobmanagerUrl field.

xujiangfeng001 · 2023-07-10T02:49:11Z

Hi @wolfboys ，this generally looks good, I left a comment. If I make a mistake, please correct me.

RocMarshal

Thanks @wolfboys & @xujiangfeng001 .
LGTM +1.

[Improve] Flink cluster status monitoring improvement

5eea064

github-actions bot added the BACKEND label Jul 1, 2023

wolfboys added 4 commits July 2, 2023 01:40

FlinkRESTAPIWatcher improvement

8f87839

import package improvement

a5a934c

String join improve

d21a36a

ddl improve

2e31a61

github-actions bot added the BUILD label Jul 1, 2023

Flink cluster state monitoring improve

6a9d1d3

flink rest api watcher improvement

2afc503

RocMarshal reviewed Jul 2, 2023

View reviewed changes

xujiangfeng001 reviewed Jul 7, 2023

View reviewed changes

flink cluster jobManagerUrl improvement

de74103

xujiangfeng001 reviewed Jul 10, 2023

View reviewed changes

RocMarshal approved these changes Jul 11, 2023

View reviewed changes

wolfboys merged commit 9f82570 into dev Jul 11, 2023

wolfboys deleted the cluster-state branch July 11, 2023 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve] Flink cluster status monitoring improvement #2826

[Improve] Flink cluster status monitoring improvement #2826

wolfboys commented Jul 1, 2023

wolfboys commented Jul 2, 2023

RocMarshal left a comment

RocMarshal Jul 2, 2023

wolfboys Jul 2, 2023

xujiangfeng001 Jul 7, 2023

RocMarshal Jul 2, 2023

wolfboys Jul 2, 2023

xujiangfeng001 Jul 3, 2023

xujiangfeng001 commented Jul 3, 2023

xujiangfeng001 Jul 3, 2023

xujiangfeng001 Jul 7, 2023

wolfboys commented Jul 10, 2023

xujiangfeng001 Jul 10, 2023

xujiangfeng001 commented Jul 10, 2023

RocMarshal left a comment

[Improve] Flink cluster status monitoring improvement #2826

[Improve] Flink cluster status monitoring improvement #2826

Conversation

wolfboys commented Jul 1, 2023

wolfboys commented Jul 2, 2023

RocMarshal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xujiangfeng001 commented Jul 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfboys commented Jul 10, 2023

Choose a reason for hiding this comment

xujiangfeng001 commented Jul 10, 2023

RocMarshal left a comment

Choose a reason for hiding this comment