MPIJob doesn't support exitcode restartPolicy #1768

shadowdsp · 2023-03-02T09:57:05Z

Add restart policy ExitCode for launcher.
Delete one of the running worker, the launcher will be failed, exit code is 137.
And then worker re-created, launcher never restarts.

laucher log:

[tensorflow-mnist-launcher:00001] Warning: could not find environment variable "LD_LIBRARY_PATH"
+ POD_NAME=tensorflow-mnist-worker-1
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=tensorflow-mnist-worker-0
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[50582,0],0] on node tensorflow-mnist-launcher
  Remote daemon: [[50582,0],1] on node tensorflow-mnist-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137

yaml:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 3
  mpiReplicaSpecs:
    Launcher:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      restartPolicy: ExitCode
      replicas: 2
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            resources:
              limits:
                cpu: 2
                memory: 4Gi

Does MPIJob support exitcode restart policy?

The text was updated successfully, but these errors were encountered:

Syulin7 · 2023-03-03T03:12:23Z

Currently it looks like if the MPIJob restart policy is set to exitcode, the launcher pod restart policy will be set to Never. When the launcher fails, the operator will only set the MPIJob status to Restarting and will not delete the launcher pod. Therefore, the launcher will remain in an Error state and the MPIJob will remain in a Restarting state.

$kubectl get mpijob
NAME                   AGE   STATE
mpi-tensorflow-mnist   98s   Restarting
$training-operator kubectl get pod
NAME                            READY   STATUS       RESTARTS   AGE
mpi-tensorflow-mnist-launcher   0/1     Error        0          109s
mpi-tensorflow-mnist-worker-0   1/1     Running      0          61s
mpi-tensorflow-mnist-worker-1   1/1     Running      0          109s

@johnugeorge @tenzen-y Do we need to fix this issue, or wait until we merge the v2 operator into the training-operator?

tenzen-y · 2023-03-03T04:22:21Z

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2.
However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.

ref: #1479

@alculquicondor @terrytangyuan WDYT?

Syulin7 · 2023-03-03T06:15:10Z

I would like to do it in both v1 and v2.

tenzen-y · 2023-03-03T06:20:34Z

We should wait for responses from other owners.

tenzen-y · 2023-03-03T09:28:51Z

/kind feature

tenzen-y · 2023-03-03T17:36:25Z

cc: @zw0610

tenzen-y · 2023-03-21T15:36:29Z

@johnugeorge @zw0610 What do you think about supporting ExitCode restartPolicy in MPIJob v1, similar to other frameworks?

johnugeorge · 2023-05-17T11:26:57Z

@tenzen-y We should add exitcode restartPolicy in v1 as well to be consistent

tenzen-y · 2023-05-17T17:21:41Z

@tenzen-y We should add exitcode restartPolicy in v1 as well to be consistent

Agree. @Syulin7 Do you have enough bandwidth to work on this?

Syulin7 · 2023-05-18T03:21:52Z

/assign
I will do it this week.

github-actions · 2023-08-23T20:01:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2023-08-24T13:27:55Z

Hi @Syulin7, did you get a chance to work on this ?
Thank you for your contributions!

Syulin7 · 2023-09-05T10:01:30Z

Hi @Syulin7, did you get a chance to work on this ? Thank you for your contributions!

@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this.

kuizhiqing · 2023-09-06T07:01:14Z

I'm not sure is this the right place to discuss this, but I was confused about the future of the two versions of mpi-operator. I'm aware of that mpi-operator v2 (kubeflow/mpi-operator) will not port into training-operator, so we will continue develop the two version or do we have any new plan ?

@tenzen-y @johnugeorge @andreyvelich @terrytangyuan

alculquicondor · 2023-09-06T15:01:10Z

My stance has always been that users should transition to v2beta1 (eventually it should become v2) and we should deprecate (and eventually remove) the v1 implementation.

But yeah, v2's architecture is somewhat different from the rest of the training operators, so it would be quite some effort to put it in the same repo. And there are some folks that prefer having it separately so that they can have a light installation of the mpi-operator only.

andreyvelich · 2023-09-06T15:35:48Z

@kuizhiqing @alculquicondor I think, we should discuss the future of MPI Operator in one of our AutoML and Training WG Community meetings.
Please propose a time slot that works best for you (11am UTC or 5 pm UTC).

alculquicondor · 2023-09-06T15:40:47Z

I unfortunately don't have the bandwidth to drive such effort. But I'm happy to collaborate on a plan if someone else takes the lead.

tenzen-y · 2023-09-07T05:04:35Z

Ideally, we should migrate the v2 implementations to the training operator, then remove the v1 implementation from the training-operator to reduce the maintenance costs. However, we can not take the way immediately because there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on). So, I think it would be better to mark the v1 implementation as deprecated, then stop adding the new features to the v1 implementation and only provide bug fixes. So we suggest using the mpi-operator to users if they would like to the new features.

@kuizhiqing @johnugeorge @alculquicondor WDYT?

tenzen-y · 2023-09-07T05:16:36Z

@terrytangyuan

alculquicondor · 2023-09-07T12:13:07Z

Let's move the discussion about deprecation to a new issue #1906

tenzen-y · 2023-09-14T06:04:22Z

/remove-area 1.7.0

github-actions · 2023-12-13T10:01:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-01-02T15:01:46Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

shadowdsp · 2024-05-29T11:25:57Z

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2. However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.

ref: #1479

@alculquicondor @terrytangyuan WDYT?

Hi @tenzen-y , does v2 support retry policy now?

tenzen-y · 2024-06-12T13:00:41Z

IIRC, MPIJob doesn't support restartPolicy=ExitCode in both v1 and v2. However, it might be better to implement the same logic to both controllers since we don't decide if we add the v2 controller to the training operator.
ref: #1479
@alculquicondor @terrytangyuan WDYT?

Hi @tenzen-y , does v2 support retry policy now?

not yet.

Syulin7 mentioned this issue Mar 3, 2023

Support exitCode restartPolicy kubeflow/mpi-operator#537

Open

google-oss-prow bot added the kind/feature label Mar 3, 2023

johnugeorge added the area/1.7.0 label May 17, 2023

google-oss-prow bot assigned Syulin7 May 18, 2023

tenzen-y mentioned this issue May 22, 2023

[Release] Training operator 1.7.0 release #1809

Closed

8 tasks

github-actions bot added the lifecycle/stale label Aug 23, 2023

github-actions bot removed the lifecycle/stale label Aug 24, 2023

alculquicondor mentioned this issue Sep 7, 2023

Deprecate MPIJob v1 #1906

Closed

google-oss-prow bot removed the area/1.7.0 label Sep 14, 2023

github-actions bot added the lifecycle/stale label Dec 13, 2023

github-actions bot closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPIJob doesn't support exitcode restartPolicy #1768

MPIJob doesn't support exitcode restartPolicy #1768

shadowdsp commented Mar 2, 2023

Syulin7 commented Mar 3, 2023 •

edited

Loading

tenzen-y commented Mar 3, 2023 •

edited

Loading

Syulin7 commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 21, 2023 •

edited

Loading

johnugeorge commented May 17, 2023

tenzen-y commented May 17, 2023

Syulin7 commented May 18, 2023

github-actions bot commented Aug 23, 2023

andreyvelich commented Aug 24, 2023

Syulin7 commented Sep 5, 2023 •

edited

Loading

kuizhiqing commented Sep 6, 2023

alculquicondor commented Sep 6, 2023

andreyvelich commented Sep 6, 2023

alculquicondor commented Sep 6, 2023

tenzen-y commented Sep 7, 2023

tenzen-y commented Sep 7, 2023

alculquicondor commented Sep 7, 2023

tenzen-y commented Sep 14, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Jan 2, 2024

shadowdsp commented May 29, 2024

tenzen-y commented Jun 12, 2024

MPIJob doesn't support exitcode restartPolicy #1768

MPIJob doesn't support exitcode restartPolicy #1768

Comments

shadowdsp commented Mar 2, 2023

Syulin7 commented Mar 3, 2023 • edited Loading

tenzen-y commented Mar 3, 2023 • edited Loading

Syulin7 commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 3, 2023

tenzen-y commented Mar 21, 2023 • edited Loading

johnugeorge commented May 17, 2023

tenzen-y commented May 17, 2023

Syulin7 commented May 18, 2023

github-actions bot commented Aug 23, 2023

andreyvelich commented Aug 24, 2023

Syulin7 commented Sep 5, 2023 • edited Loading

kuizhiqing commented Sep 6, 2023

alculquicondor commented Sep 6, 2023

andreyvelich commented Sep 6, 2023

alculquicondor commented Sep 6, 2023

tenzen-y commented Sep 7, 2023

tenzen-y commented Sep 7, 2023

alculquicondor commented Sep 7, 2023

tenzen-y commented Sep 14, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Jan 2, 2024

shadowdsp commented May 29, 2024

tenzen-y commented Jun 12, 2024

Syulin7 commented Mar 3, 2023 •

edited

Loading

tenzen-y commented Mar 3, 2023 •

edited

Loading

tenzen-y commented Mar 21, 2023 •

edited

Loading

Syulin7 commented Sep 5, 2023 •

edited

Loading