You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Distribution training with mindspore/tensorflow/pytorch cannot stop, only master can be a completed status, and the worker will be always running whatever the master status is. The master will be CrashLoopBackOff status in a few minutes.
The log of master is good, no error or warning.
What you expected to happen:
I am confused about how to stop the jobs. How to reproduce it (as minimally and precisely as possible):
here is the zip file including yaml, training code and training data demo.zip
@jxfruit Reproduced the issue. You make wrong configuration for "restartPolicy: Never" field in master task of volcano job. Two spaces missing before "restartPolicy: Never".
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Distribution training with mindspore/tensorflow/pytorch cannot stop, only master can be a completed status, and the worker will be always running whatever the master status is. The master will be CrashLoopBackOff status in a few minutes.
The log of master is good, no error or warning.
What you expected to happen:
I am confused about how to stop the jobs.
How to reproduce it (as minimally and precisely as possible):
here is the zip file including yaml, training code and training data
demo.zip
And here is the relate issues(volcano-sh/devices#10) (#1136)
Anything else we need to know?:
Environment:
kubectl version
): 1.17.3uname -a
): Linux dls1 4.15.0-123-generic [Issue #121]fix state convert #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: