Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect hang by tensor util #1448

Merged

Conversation

majieyue
Copy link
Collaborator

@majieyue majieyue commented Jan 21, 2025

What changes were proposed in this pull request?

upgrade metric module to provide metric trace api
add metric collector thread in diagnosis manager
add --detect-hang-by-xpu-util paramters to restart job when xpu util drop to zero

Why are the changes needed?

Use basic tensor util(GPU) or NPU util(NPU) to detect job hang

Does this PR introduce any user-facing change?

YES, add --xpu_type, --hang_detect and --hang_downtime in job args

How was this patch tested?

UT

Ma Jie Yue added 2 commits January 21, 2025 22:30
the agent will update master worker local xpu type, which is GPU or NPU. the master will
start metric collector in diagnosis manager
Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 85.85859% with 70 lines in your changes missing coverage. Please review.

Project coverage is 81.62%. Comparing base (03c965f) to head (a47a4e5).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...rover/python/master/diagnosis/diagnosis_manager.py 75.00% 22 Missing ⚠️
dlrover/python/common/metric/monitor.py 66.12% 21 Missing ⚠️
dlrover/python/common/metric/context.py 79.03% 13 Missing ⚠️
dlrover/python/master/main.py 0.00% 8 Missing ⚠️
dlrover/python/common/metric/metric.py 90.00% 2 Missing ⚠️
dlrover/python/master/dist_master.py 60.00% 2 Missing ⚠️
dlrover/python/elastic_agent/torch/training.py 50.00% 1 Missing ⚠️
dlrover/python/master/node/dist_job_manager.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1448      +/-   ##
==========================================
+ Coverage   81.53%   81.62%   +0.09%     
==========================================
  Files         240      240              
  Lines       23592    24045     +453     
==========================================
+ Hits        19235    19626     +391     
- Misses       4357     4419      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@majieyue majieyue requested a review from nash635 as a code owner January 26, 2025 11:20
@majieyue majieyue changed the title [WIP] Detect hang by tensor util Detect hang by tensor util Jan 26, 2025
Copy link
Collaborator

@BalaBalaYi BalaBalaYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@majieyue majieyue merged commit 2ce0e02 into intelligent-machine-learning:master Jan 27, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants