Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIJob doesn't support exitcode restartPolicy #1768

Closed
shadowdsp opened this issue Mar 2, 2023 · 25 comments
Closed

MPIJob doesn't support exitcode restartPolicy #1768

shadowdsp opened this issue Mar 2, 2023 · 25 comments

Comments

@shadowdsp
Copy link

  1. Add restart policy ExitCode for launcher.
  2. Delete one of the running worker, the launcher will be failed, exit code is 137.
  3. And then worker re-created, launcher never restarts.

laucher log:

[tensorflow-mnist-launcher:00001] Warning: could not find environment variable "LD_LIBRARY_PATH"
+ POD_NAME=tensorflow-mnist-worker-1
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=tensorflow-mnist-worker-0
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[50582,0],0] on node tensorflow-mnist-launcher
  Remote daemon: [[50582,0],1] on node tensorflow-mnist-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137

yaml:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 3
  mpiReplicaSpecs:
    Launcher:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      restartPolicy: ExitCode
      replicas: 2
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            resources:
              limits:
                cpu: 2
                memory: 4Gi

Does MPIJob support exitcode restart policy?

@Syulin7
Copy link
Contributor

Syulin7 commented Mar 3, 2023

Currently it looks like if the MPIJob restart policy is set to exitcode, the launcher pod restart policy will be set to Never. When the launcher fails, the operator will only set the MPIJob status to Restarting and will not delete the launcher pod. Therefore, the launcher will remain in an Error state and the MPIJob will remain in a Restarting state.

$kubectl get mpijob
NAME                   AGE   STATE
mpi-tensorflow-mnist   98s   Restarting
$training-operator kubectl get pod
NAME                            READY   STATUS       RESTARTS   AGE
mpi-tensorflow-mnist-launcher   0/1     Error        0          109s
mpi-tensorflow-mnist-worker-0   1/1     Running      0          61s
mpi-tensorflow-mnist-worker-1   1/1     Running      0          109s

@johnugeorge @tenzen-y Do we need to fix this issue, or wait until we merge the v2 operator into the training-operator?

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2.
However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.

ref: #1479

@alculquicondor @terrytangyuan WDYT?

@Syulin7
Copy link
Contributor

Syulin7 commented Mar 3, 2023

I would like to do it in both v1 and v2.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

We should wait for responses from other owners.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

/kind feature

@tenzen-y
Copy link
Member

tenzen-y commented Mar 3, 2023

cc: @zw0610

@tenzen-y
Copy link
Member

tenzen-y commented Mar 21, 2023

@johnugeorge @zw0610 What do you think about supporting ExitCode restartPolicy in MPIJob v1, similar to other frameworks?

@johnugeorge
Copy link
Member

@tenzen-y We should add exitcode restartPolicy in v1 as well to be consistent

@tenzen-y
Copy link
Member

@tenzen-y We should add exitcode restartPolicy in v1 as well to be consistent

Agree. @Syulin7 Do you have enough bandwidth to work on this?

@Syulin7
Copy link
Contributor

Syulin7 commented May 18, 2023

/assign
I will do it this week.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

Hi @Syulin7, did you get a chance to work on this ?
Thank you for your contributions!

@Syulin7
Copy link
Contributor

Syulin7 commented Sep 5, 2023

Hi @Syulin7, did you get a chance to work on this ? Thank you for your contributions!

@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this.

@kuizhiqing
Copy link
Member

I'm not sure is this the right place to discuss this, but I was confused about the future of the two versions of mpi-operator. I'm aware of that mpi-operator v2 (kubeflow/mpi-operator) will not port into training-operator, so we will continue develop the two version or do we have any new plan ?

@tenzen-y @johnugeorge @andreyvelich @terrytangyuan

@alculquicondor
Copy link

My stance has always been that users should transition to v2beta1 (eventually it should become v2) and we should deprecate (and eventually remove) the v1 implementation.

But yeah, v2's architecture is somewhat different from the rest of the training operators, so it would be quite some effort to put it in the same repo. And there are some folks that prefer having it separately so that they can have a light installation of the mpi-operator only.

@andreyvelich
Copy link
Member

@kuizhiqing @alculquicondor I think, we should discuss the future of MPI Operator in one of our AutoML and Training WG Community meetings.
Please propose a time slot that works best for you (11am UTC or 5 pm UTC).

@alculquicondor
Copy link

I unfortunately don't have the bandwidth to drive such effort. But I'm happy to collaborate on a plan if someone else takes the lead.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 7, 2023

Ideally, we should migrate the v2 implementations to the training operator, then remove the v1 implementation from the training-operator to reduce the maintenance costs. However, we can not take the way immediately because there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on). So, I think it would be better to mark the v1 implementation as deprecated, then stop adding the new features to the v1 implementation and only provide bug fixes. So we suggest using the mpi-operator to users if they would like to the new features.

@kuizhiqing @johnugeorge @alculquicondor WDYT?

@tenzen-y
Copy link
Member

tenzen-y commented Sep 7, 2023

@terrytangyuan

@alculquicondor
Copy link

Let's move the discussion about deprecation to a new issue #1906

@tenzen-y
Copy link
Member

/remove-area 1.7.0

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

github-actions bot commented Jan 2, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as completed Jan 2, 2024
@shadowdsp
Copy link
Author

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2. However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.

ref: #1479

@alculquicondor @terrytangyuan WDYT?

Hi @tenzen-y , does v2 support retry policy now?

@tenzen-y
Copy link
Member

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2. However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.
ref: #1479
@alculquicondor @terrytangyuan WDYT?

Hi @tenzen-y , does v2 support retry policy now?

not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants