Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Properly designate model state for actively training models when nodes crash or leave cluster #1348

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

ryanbogan
Copy link
Member

Description

There is currently a bug where models will be stuck in the state TRAINING when a node crashes or leaves the cluster. Since there is a write block on training models, they cannot be removed even though they are not actually training. This PR marks the models as their proper state (either ZOMBIE or FAILED) when a node crashes or leaves the cluster, so that the zombie models can be deleted.

Backport of #1317

Issues Resolved

#837

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ryanbogan ryanbogan added v2.12.0 Bug Fixes Changes to a system or product designed to handle a programming bug/glitch labels Dec 12, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (06d52d5) 85.19% compared to head (43ad00e) 85.13%.

Files Patch % Lines
.../knn/training/TrainingJobClusterStateListener.java 81.15% 10 Missing and 3 partials ⚠️
...org/opensearch/knn/training/TrainingJobRunner.java 22.22% 6 Missing and 1 partial ⚠️
...java/org/opensearch/knn/indices/ModelMetadata.java 88.88% 1 Missing and 3 partials ⚠️
.../main/java/org/opensearch/knn/index/IndexUtil.java 66.66% 1 Missing and 1 partial ⚠️
...plugin/transport/TrainingModelTransportAction.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x    #1348      +/-   ##
============================================
- Coverage     85.19%   85.13%   -0.06%     
- Complexity     1194     1219      +25     
============================================
  Files           155      156       +1     
  Lines          4903     5012     +109     
  Branches        459      475      +16     
============================================
+ Hits           4177     4267      +90     
- Misses          528      540      +12     
- Partials        198      205       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ryanbogan ryanbogan merged commit fda94bc into 2.x Dec 13, 2023
87 of 89 checks passed
@ryanbogan ryanbogan deleted the backport/backport-1317-to-2.x branch December 13, 2023 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Fixes Changes to a system or product designed to handle a programming bug/glitch v2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants