Merge Kubeflow/common to training operator #1714

johnugeorge · 2022-12-28T06:58:08Z

Kubeflow/common was initially created to have a common library for all frameworks in separate repos. Since we have already merged all operators into a single training operator, do we really need a separate common repository now?

Motivation:

Many changes required for training operator requires changes in common code as well. This adds a huge overhead in maintenance. (Ref: Changing label selector type for HPA common#197, update k8s dependencies to 1.25 common#198, Remove deprecated labels common#200)
Maintenance of separate kubeflow/common repo and keeping things in sync with training operator is an overhead (Also considering CI and testing)

Related: #1703

/cc @terrytangyuan @zw0610 @gaocegege @Jeffwan

gaocegege · 2022-12-28T07:52:34Z

/cc @alculquicondor

terrytangyuan · 2022-12-28T14:39:42Z

I agree that we can consider merging it now that all operators are in training-operator repo. MPI operator v1 still uses this separately and we should put into some thoughts as well.

zw0610 · 2022-12-30T02:22:18Z

I agree the common repository should be merged and the scripts in the common repository should be further simplified to make the training-operator easier to understand for new developers.

johnugeorge · 2022-12-31T11:51:03Z

@alculquicondor Can you share your thoughts on this wrt mpi operator dependency?

terrytangyuan · 2023-01-01T01:54:25Z

From past conversations with @alculquicondor, it seems like there's preference to keep v2 MPI controller separate from training-operator due to some use cases to non-ML users. Is that still the case? Perhaps v2 can be migrated to another project under K8s sigs to make it more generic to all users and v1 will stay in training-operator to be suitable for ML use cases only. I believe it should be fine if we keep the copyright notices "Kubeflow Authors". Once Kubeflow is accepted by CNCF, everything belongs to CNCF anyways so I don't have any legal concerns.

Any thoughts from others? Also cc @ahg-g

alculquicondor · 2023-01-03T14:05:34Z

FWIIW, the only dependencies that mpi-operator has with common is just constants. So one potential starting point is to move all the logic to training-operator, but keep the constants.

Yes, I have heard the desire of having a lightweight controller for MPI jobs from HPC users.
However, I would not advocate for 2 competing implementations. There should just be one and kubeflow should probably just embed it.

kubernetes-sigs would be a good fit. SIG Apps already had interest in hosting the project. But we would have to wait until the ownership goes to CNCF.

andreyvelich · 2023-01-03T15:46:55Z

I like this idea.
I remember previously we wanted to keep common repo separate because some users have their own internal Training Operator CRDs based on this library.
@johnugeorge @gaocegege @terrytangyuan @alculquicondor Are we aware of any users that are actively using this repository ?
Maybe we should stay in touch with them.

terrytangyuan · 2023-01-04T13:56:26Z

My team at Ant Group was using it but it's no longer the case. I don't know any other teams that are relying on this repo.

@alculquicondor I think MPI operator repo can keep those constants itself so that we can remove the dependency on Kubeflow/common at once. WDYT?

alculquicondor · 2023-01-04T15:07:32Z

Would it be too bad to make mpi-operator depend on the constants from training-operator?

terrytangyuan · 2023-01-04T16:39:17Z

Would it be too bad to make mpi-operator depend on the constants from training-operator?

I am not sure if that's a good idea as training-operator might be too heavy and there may be conflicts in dependencies. We only reply on very minimal set of constants so they should be easy to maintain.

johnugeorge · 2023-01-04T18:13:02Z

I agree with @terrytangyuan . I think, It is not worth managing overhead in handling dependencies for mpi-operator to use minimal set of constants.(If that is the concern). I leave it to you @alculquicondor

Can folks give +1 to merge kubeflow/common to kubeflow/training-operator? We can take this up after creating release branch for KF release 1.7 during late Jan

terrytangyuan · 2023-01-04T18:13:52Z

+1

alculquicondor · 2023-01-04T18:21:39Z

I guess the common repo would just be archived and we can still depend on it while we don't need new constants (likely to be true for a long time).

+1

andreyvelich · 2023-01-05T14:54:39Z

+1
cc @kubeflow/wg-training-leads

gaocegege · 2023-01-06T00:59:51Z

LGTM 👍
/lgtm

gaocegege · 2023-01-06T01:43:31Z

@johnugeorge @gaocegege @terrytangyuan @alculquicondor Are we aware of any users that are actively using this repository ?

Some enterprise users may use it to write their own operators.

But I think they will lock on a specified version.

johnugeorge · 2023-01-07T16:45:28Z

Based on the responses, kubeflow/common will be merged into kubeflow/training-operator post 1.7 release branch creation.

alculquicondor · 2023-02-03T16:40:02Z

Was a release branch opened?

tenzen-y · 2023-02-03T21:22:53Z

Was a release branch opened?

@alculquicondor Yes, we opened RC.0 branch; https://github.com/kubeflow/training-operator/tree/v1.6-branch.
However, we will freeze to work in the kubeflow/common repository until cutting a final release.

kubeflow/common#209 (comment)

alculquicondor · 2023-02-23T16:12:32Z

are we still on freeze?

johnugeorge · 2023-02-23T16:59:26Z

@alculquicondor Final release is not cut yet. I am expecting it to happen in a week

tenzen-y · 2023-05-15T20:11:37Z

/assign @johnugeorge

johnugeorge · 2023-08-07T13:11:41Z

Closing this as kubeflow/common is merged into training-operator

Aunpuncode · 2023-12-11T06:27:09Z

Kubeflow/common was initially created to have a common library for all frameworks in separate repos. Since we have already merged all operators into a single training operator, do we really need a separate common repository now?

Motivation:

Many changes required for training operator requires changes in common code as well. This adds a huge overhead in maintenance. (Ref: Changing label selector type for HPA common#197, update k8s dependencies to 1.25 common#198, Remove deprecated labels common#200)

Maintenance of separate kubeflow/common repo and keeping things in sync with training operator is an overhead (Also considering CI and testing)

Related: #1703

/cc @terrytangyuan @zw0610 @gaocegege @Jeffwan

This was referenced Jan 10, 2023

Add job suspend semantics kubeflow/common#196

Open

fix https://github.com/kubeflow/training-operator/issues/1704 #1705

Merged

Support coscheduling plugin #1722

Closed

tenzen-y mentioned this issue Jan 13, 2023

Introduce batch/v1 Job with Indexed completion mode #1718

Open

This was referenced Jan 17, 2023

Fully consolidate tf-operator to training-operator #1727

Closed

Refactor mpijob-controller #1728

Open

Add MaxConcurrentReconciles to JobControllerConfiguration kubeflow/common#205

Open

tenzen-y mentioned this issue Jan 17, 2023

[feature] Add reconciler.v1 package along with controller.v1 kubeflow/common#140

Open

tenzen-y mentioned this issue Jan 27, 2023

Support suspend semantics for MPIJob kubeflow/mpi-operator#511

Merged

tenzen-y mentioned this issue Mar 30, 2023

Support queue-related logic with kube-queue #1519

Closed

google-oss-prow bot assigned johnugeorge May 15, 2023

tenzen-y mentioned this issue May 15, 2023

Support new scheduler-plugins API group #1769

Closed

johnugeorge added the area/1.7.0 label May 17, 2023

johnugeorge changed the title ~~Discuss: Merge Kubeflow/common to training operator~~ Merge Kubeflow/common to training operator May 22, 2023

This was referenced May 22, 2023

[Release] Training operator 1.7.0 release #1809

Closed

Merge kubeflow/common to training-operator #1813

Merged

johnugeorge closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Kubeflow/common to training operator #1714

Merge Kubeflow/common to training operator #1714

johnugeorge commented Dec 28, 2022

gaocegege commented Dec 28, 2022

terrytangyuan commented Dec 28, 2022

zw0610 commented Dec 30, 2022

johnugeorge commented Dec 31, 2022

terrytangyuan commented Jan 1, 2023 •

edited

Loading

alculquicondor commented Jan 3, 2023

andreyvelich commented Jan 3, 2023

terrytangyuan commented Jan 4, 2023

alculquicondor commented Jan 4, 2023

terrytangyuan commented Jan 4, 2023

johnugeorge commented Jan 4, 2023

terrytangyuan commented Jan 4, 2023

alculquicondor commented Jan 4, 2023

andreyvelich commented Jan 5, 2023

gaocegege commented Jan 6, 2023

gaocegege commented Jan 6, 2023

johnugeorge commented Jan 7, 2023

alculquicondor commented Feb 3, 2023

tenzen-y commented Feb 3, 2023 •

edited

Loading

alculquicondor commented Feb 23, 2023

johnugeorge commented Feb 23, 2023

tenzen-y commented May 15, 2023

johnugeorge commented Aug 7, 2023

Aunpuncode commented Dec 11, 2023

Merge Kubeflow/common to training operator #1714

Merge Kubeflow/common to training operator #1714

Comments

johnugeorge commented Dec 28, 2022

gaocegege commented Dec 28, 2022

terrytangyuan commented Dec 28, 2022

zw0610 commented Dec 30, 2022

johnugeorge commented Dec 31, 2022

terrytangyuan commented Jan 1, 2023 • edited Loading

alculquicondor commented Jan 3, 2023

andreyvelich commented Jan 3, 2023

terrytangyuan commented Jan 4, 2023

alculquicondor commented Jan 4, 2023

terrytangyuan commented Jan 4, 2023

johnugeorge commented Jan 4, 2023

terrytangyuan commented Jan 4, 2023

alculquicondor commented Jan 4, 2023

andreyvelich commented Jan 5, 2023

gaocegege commented Jan 6, 2023

gaocegege commented Jan 6, 2023

johnugeorge commented Jan 7, 2023

alculquicondor commented Feb 3, 2023

tenzen-y commented Feb 3, 2023 • edited Loading

alculquicondor commented Feb 23, 2023

johnugeorge commented Feb 23, 2023

tenzen-y commented May 15, 2023

johnugeorge commented Aug 7, 2023

Aunpuncode commented Dec 11, 2023

terrytangyuan commented Jan 1, 2023 •

edited

Loading

tenzen-y commented Feb 3, 2023 •

edited

Loading