Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

Closed
jxfruit opened this issue Nov 26, 2020 · 2 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jxfruit
Copy link

jxfruit commented Nov 26, 2020

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
Distribution training with mindspore/tensorflow/pytorch cannot stop, only master can be a completed status, and the worker will be always running whatever the master status is. The master will be CrashLoopBackOff status in a few minutes.
image
The log of master is good, no error or warning.

What you expected to happen:
I am confused about how to stop the jobs.
How to reproduce it (as minimally and precisely as possible):
here is the zip file including yaml, training code and training data
demo.zip

And here is the relate issues(volcano-sh/devices#10) (#1136)

Anything else we need to know?:

Environment:

  • Volcano Version: 1.0.1
  • Kubernetes version (use kubectl version): 1.17.3
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): ubuntu18.04
  • Kernel (e.g. uname -a): Linux dls1 4.15.0-123-generic [Issue #121]fix state convert #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: cuda10.1 driver version: 418.87.00
  • Others:
@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2020
@william-wang
Copy link
Member

@jxfruit Reproduced the issue. You make wrong configuration for "restartPolicy: Never" field in master task of volcano job. Two spaces missing before "restartPolicy: Never".

@jxfruit
Copy link
Author

jxfruit commented Nov 28, 2020

that's right, tks

@jxfruit jxfruit closed this as completed Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants