-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPIJob doesn't support exitcode restartPolicy #1768
Comments
Currently it looks like if the MPIJob restart policy is set to exitcode, the launcher pod restart policy will be set to Never. When the launcher fails, the operator will only set the MPIJob status to Restarting and will not delete the launcher pod. Therefore, the launcher will remain in an Error state and the MPIJob will remain in a Restarting state.
@johnugeorge @tenzen-y Do we need to fix this issue, or wait until we merge the v2 operator into the training-operator? |
IIRC, MPIJob doesn't support ref: #1479 |
I would like to do it in both v1 and v2. |
We should wait for responses from other owners. |
/kind feature |
cc: @zw0610 |
@johnugeorge @zw0610 What do you think about supporting ExitCode restartPolicy in MPIJob v1, similar to other frameworks? |
@tenzen-y We should add exitcode restartPolicy in v1 as well to be consistent |
/assign |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @Syulin7, did you get a chance to work on this ? |
@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this. |
I'm not sure is this the right place to discuss this, but I was confused about the future of the two versions of mpi-operator. I'm aware of that mpi-operator v2 (kubeflow/mpi-operator) will not port into training-operator, so we will continue develop the two version or do we have any new plan ? |
My stance has always been that users should transition to v2beta1 (eventually it should become v2) and we should deprecate (and eventually remove) the v1 implementation. But yeah, v2's architecture is somewhat different from the rest of the training operators, so it would be quite some effort to put it in the same repo. And there are some folks that prefer having it separately so that they can have a light installation of the mpi-operator only. |
@kuizhiqing @alculquicondor I think, we should discuss the future of MPI Operator in one of our AutoML and Training WG Community meetings. |
I unfortunately don't have the bandwidth to drive such effort. But I'm happy to collaborate on a plan if someone else takes the lead. |
Ideally, we should migrate the v2 implementations to the training operator, then remove the v1 implementation from the training-operator to reduce the maintenance costs. However, we can not take the way immediately because there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on). So, I think it would be better to mark the v1 implementation as deprecated, then stop adding the new features to the v1 implementation and only provide bug fixes. So we suggest using the mpi-operator to users if they would like to the new features. |
Let's move the discussion about deprecation to a new issue #1906 |
/remove-area 1.7.0 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
not yet. |
laucher log:
yaml:
Does MPIJob support exitcode restart policy?
The text was updated successfully, but these errors were encountered: