Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve] Flink cluster status monitoring improvement #2826

Merged
merged 8 commits into from
Jul 11, 2023
Merged

Conversation

wolfboys
Copy link
Member

@wolfboys wolfboys commented Jul 1, 2023

[Improve] Flink cluster status monitoring improvement

@github-actions github-actions bot added the BUILD label Jul 1, 2023
@wolfboys
Copy link
Member Author

wolfboys commented Jul 2, 2023

cc @xujiangfeng001 @RocMarshal PTAL, thx

Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @wolfboys @xujiangfeng001 driving this pr.
I left a few of comments, PTAL in your free time.
Thank you~

return ClusterState.RUNNING;
default:
return ClusterState.STOPPED;
if (status == FinalApplicationStatus.UNDEFINED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the yarn application was accepted by yarn resourcemanager, the FinalApplicationStatus is UNDEFINED too, but the flink cluster isn't RUNNING

This case was mentioned in #2541 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the yarn application was accepted by yarn resourcemanager, the FinalApplicationStatus is UNDEFINED too, but the flink cluster isn't RUNNING

This case was mentioned in #2541 (comment)

I also have questions about this logic and further discussion is needed. @xujiangfeng001

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my negligence. The corresponding logic was not modified during the implementation of this piece of content. The logic here does not need to determine the finalStatus of the application, only the application state needs to be determined as running to consider the flink cluster as running.

public enum YarnApplicationState {
  /** Application which was just created. */
  NEW,

  /** Application which is being saved. */
  NEW_SAVING,

  /** Application which has been submitted. */
  SUBMITTED,

  /** Application has been accepted by the scheduler */
  ACCEPTED,

  /** Application which is currently running. */
  RUNNING,

  /** Application which finished successfully. */
  FINISHED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}

Comment on lines +136 to 139
<select id="countJobsByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
select
count(1)
from t_flink_app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to filter the status of the application here? #2809 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

It is difficult to directly query the data table for this issue. I think we need a cache to record the running jobs of each flink cluster.

@xujiangfeng001
Copy link
Contributor

Hi @wolfboys @RocMarshal ,I left a few of comments. Please take a look at it when you have time.

Comment on lines +136 to 139
<select id="countJobsByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
select
count(1)
from t_flink_app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to filter the status of the application here? #2809 (comment)

Thanks for your feedback. I agree with you. I think We should filter the status of the application

It is difficult to directly query the data table for this issue. I think we need a cache to record the running jobs of each flink cluster.

return ClusterState.RUNNING;
default:
return ClusterState.STOPPED;
if (status == FinalApplicationStatus.UNDEFINED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my negligence. The corresponding logic was not modified during the implementation of this piece of content. The logic here does not need to determine the finalStatus of the application, only the application state needs to be determined as running to consider the flink cluster as running.

public enum YarnApplicationState {
  /** Application which was just created. */
  NEW,

  /** Application which is being saved. */
  NEW_SAVING,

  /** Application which has been submitted. */
  SUBMITTED,

  /** Application has been accepted by the scheduler */
  ACCEPTED,

  /** Application which is currently running. */
  RUNNING,

  /** Application which finished successfully. */
  FINISHED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}

@wolfboys
Copy link
Member Author

cc @RocMarshal @xujiangfeng001 PTAL

ClusterState state = getClusterStateFromFlinkAPI(flinkCluster);
if (ClusterState.isRunningState(state)) {
public ClusterState getClusterState(FlinkCluster flinkCluster) {
ClusterState state = FAILED_STATES.getIfPresent(flinkCluster.getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with the logic in this method: here, in Yarn Session mode, it won't go jobmanagerUrl request status, only yarn restful api requests will be processed. This is not in line with our original intention of designing the jobmanagerUrl field.

@xujiangfeng001
Copy link
Contributor

Hi @wolfboys ,this generally looks good, I left a comment. If I make a mistake, please correct me.

Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wolfboys & @xujiangfeng001 .
LGTM +1.

@wolfboys wolfboys merged commit 9f82570 into dev Jul 11, 2023
@wolfboys wolfboys deleted the cluster-state branch July 11, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants