Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New precheck procedure to enhance stability. #1453

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

BalaBalaYi
Copy link
Collaborator

What changes were proposed in this pull request?

  1. Pre-check operator api definition.
  2. Design doc.
  3. Base implement of pre-check procedure including master and worker.

Why are the changes needed?

For details, please see the design document in the current PR.

Does this PR introduce any user-facing change?

User can enable or disable the pre-check function through job args. For details, please see the development document in the current PR.

How was this patch tested?

UT and simple training job.

@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 26, 2025
@BalaBalaYi BalaBalaYi self-assigned this Jan 26, 2025
Copy link

codecov bot commented Jan 26, 2025

Codecov Report

Attention: Patch coverage is 92.07921% with 16 lines in your changes missing coverage. Please review.

Project coverage is 81.59%. Comparing base (03c965f) to head (3ace0da).

Files with missing lines Patch % Lines
...rover/python/master/diagnosis/diagnosis_manager.py 89.18% 4 Missing ⚠️
...rover/python/master/diagnosis/precheck_operator.py 91.17% 3 Missing ⚠️
dlrover/python/elastic_agent/master_client.py 33.33% 2 Missing ⚠️
dlrover/python/master/servicer.py 60.00% 2 Missing ⚠️
dlrover/python/master/args.py 88.88% 1 Missing ⚠️
dlrover/python/master/diagnosis/diagnosis.py 50.00% 1 Missing ⚠️
dlrover/python/master/node/job_context.py 88.88% 1 Missing ⚠️
dlrover/python/tests/test_pre_check_operator.py 94.11% 1 Missing ⚠️
dlrover/trainer/torch/elastic_run.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1453      +/-   ##
==========================================
+ Coverage   81.53%   81.59%   +0.06%     
==========================================
  Files         240      242       +2     
  Lines       23592    23788     +196     
==========================================
+ Hits        19235    19411     +176     
- Misses       4357     4377      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant