Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1215][FOLLOWUP] Improve PausePushDataTime and PausePushDataAndReplicateTime metric calculation logic #3069

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vastian180
Copy link

What changes were proposed in this pull request?

As title, improve PausePushDataTime、PausePushDataAndReplicateTime、pausePushDataCounter metric calculation logic.

Why are the changes needed?

During a stress test, it was found that the pausePushDataAndReplicateTime metric value was 55.1 years, which is obviously abnormal. As shown in the figure below.
image

The reason is as follows:
In the process of ServingState transition: NONE PAUSED -> PAUSE PUSH -> PAUSE PUSH AND REPLICATE
The pausePushDataAndReplicateStartTime was not correctly assigned.
When trimCounter >= forceAppendPauseSpentTimeThreshold or ServingState changes from PAUSE PUSH AND REPLICATE -> NONE PAUSED , the appendPauseSpentTime method is executed to update pausePushDataAndReplicateTime.
The execution logic is pausePushDataAndReplicateTime += System.currentTimeMillis() - -1L, which will be displayed as 55.1 years. System.currentTimeMillis()/1000/3600/24/365.

Similarly, in the process of ServingState transition: NONE PAUSED -> PAUSE PUSH AND REPLICATE -> PAUSE PUSH , the pausePushDataStartTime was not correctly assigned.
When trimCounter >= forceAppendPauseSpentTimeThreshold or ServingState changes from PAUSE PUSH -> NONE PAUSED, the appendPauseSpentTime method is executed to update pausePushDataTime, which will be displayed as 55.1 years.

Modify the logic of pausePushDataCounter:
The PAUSE PUSH AND REPLICATE state includes the worker stopping receiving pushData.
Therefore:
When NONE PAUSED -> PAUSE PUSH AND REPLICATE: pausePushDataCounter needs to be increased.
When PAUSE PUSH AND REPLICATE -> PAUSE PUSH: pausePushDataCounter does not need to be increased.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Celeborn Dashboard
image
MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic

@vastian180
Copy link
Author

Supplement the ServingState change log on the test machine, corresponding to the machine with pausePushDataAndReplicate shown in Figure 2 (the blue line):

25/01/16 22:26:22,681 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from NONE_PAUSED to PUSH_PAUSED
25/01/16 22:26:25,830 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from PUSH_PAUSED to PUSH_AND_REPLICATE_PAUSED
25/01/16 22:26:47,176 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED
25/01/16 22:26:47,846 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from PUSH_PAUSED to NONE_PAUSED
25/01/16 22:28:57,435 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from NONE_PAUSED to PUSH_PAUSED
25/01/16 22:29:01,098 INFO [worker-memory-manager-checker] MemoryManager: Serving state transformed from PUSH_PAUSED to NONE_PAUSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant