-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect hang by tensor util #1448
Detect hang by tensor util #1448
Conversation
the agent will update master worker local xpu type, which is GPU or NPU. the master will start metric collector in diagnosis manager
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1448 +/- ##
==========================================
+ Coverage 81.53% 81.62% +0.09%
==========================================
Files 240 240
Lines 23592 24045 +453
==========================================
+ Hits 19235 19626 +391
- Misses 4357 4419 +62 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2ce0e02
into
intelligent-machine-learning:master
What changes were proposed in this pull request?
upgrade metric module to provide metric trace api
add metric collector thread in diagnosis manager
add --detect-hang-by-xpu-util paramters to restart job when xpu util drop to zero
Why are the changes needed?
Use basic tensor util(GPU) or NPU util(NPU) to detect job hang
Does this PR introduce any user-facing change?
YES, add --xpu_type, --hang_detect and --hang_downtime in job args
How was this patch tested?
UT