distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

jxfruit · 2020-11-26T13:31:41Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
Distribution training with mindspore/tensorflow/pytorch cannot stop, only master can be a completed status, and the worker will be always running whatever the master status is. The master will be CrashLoopBackOff status in a few minutes.

The log of master is good, no error or warning.

What you expected to happen:
I am confused about how to stop the jobs.
How to reproduce it (as minimally and precisely as possible):
here is the zip file including yaml, training code and training data
demo.zip

And here is the relate issues(volcano-sh/devices#10) (#1136)

Anything else we need to know?:

Environment:

Volcano Version: 1.0.1
Kubernetes version (use kubectl version): 1.17.3
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): ubuntu18.04
Kernel (e.g. uname -a): Linux dls1 4.15.0-123-generic [Issue #121]fix state convert #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools: cuda10.1 driver version: 418.87.00
Others:

The text was updated successfully, but these errors were encountered:

william-wang · 2020-11-28T03:33:00Z

@jxfruit Reproduced the issue. You make wrong configuration for "restartPolicy: Never" field in master task of volcano job. Two spaces missing before "restartPolicy: Never".

jxfruit · 2020-11-28T07:34:40Z

that's right, tks

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2020

jxfruit closed this as completed Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

jxfruit commented Nov 26, 2020

william-wang commented Nov 28, 2020

jxfruit commented Nov 28, 2020

distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

distribution training jobs with mindspore/tensorflow/pytorch cannot stop #1163

Comments

jxfruit commented Nov 26, 2020

william-wang commented Nov 28, 2020

jxfruit commented Nov 28, 2020