Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Kubeflow/common to training operator #1714

Closed
Tracked by #1809
johnugeorge opened this issue Dec 28, 2022 · 24 comments
Closed
Tracked by #1809

Merge Kubeflow/common to training operator #1714

johnugeorge opened this issue Dec 28, 2022 · 24 comments
Assignees

Comments

@johnugeorge
Copy link
Member

Kubeflow/common was initially created to have a common library for all frameworks in separate repos. Since we have already merged all operators into a single training operator, do we really need a separate common repository now?

Motivation:

  1. Many changes required for training operator requires changes in common code as well. This adds a huge overhead in maintenance. (Ref: Changing label selector type for HPA common#197, update k8s dependencies to 1.25 common#198, Remove deprecated labels common#200)
  2. Maintenance of separate kubeflow/common repo and keeping things in sync with training operator is an overhead (Also considering CI and testing)

Related: #1703

/cc @terrytangyuan @zw0610 @gaocegege @Jeffwan

@gaocegege
Copy link
Member

/cc @alculquicondor

@terrytangyuan
Copy link
Member

I agree that we can consider merging it now that all operators are in training-operator repo. MPI operator v1 still uses this separately and we should put into some thoughts as well.

@zw0610
Copy link
Member

zw0610 commented Dec 30, 2022

I agree the common repository should be merged and the scripts in the common repository should be further simplified to make the training-operator easier to understand for new developers.

@johnugeorge
Copy link
Member Author

@alculquicondor Can you share your thoughts on this wrt mpi operator dependency?

@terrytangyuan
Copy link
Member

terrytangyuan commented Jan 1, 2023

From past conversations with @alculquicondor, it seems like there's preference to keep v2 MPI controller separate from training-operator due to some use cases to non-ML users. Is that still the case? Perhaps v2 can be migrated to another project under K8s sigs to make it more generic to all users and v1 will stay in training-operator to be suitable for ML use cases only. I believe it should be fine if we keep the copyright notices "Kubeflow Authors". Once Kubeflow is accepted by CNCF, everything belongs to CNCF anyways so I don't have any legal concerns.

Any thoughts from others? Also cc @ahg-g

@alculquicondor
Copy link

FWIIW, the only dependencies that mpi-operator has with common is just constants. So one potential starting point is to move all the logic to training-operator, but keep the constants.

Yes, I have heard the desire of having a lightweight controller for MPI jobs from HPC users.
However, I would not advocate for 2 competing implementations. There should just be one and kubeflow should probably just embed it.

kubernetes-sigs would be a good fit. SIG Apps already had interest in hosting the project. But we would have to wait until the ownership goes to CNCF.

@andreyvelich
Copy link
Member

I like this idea.
I remember previously we wanted to keep common repo separate because some users have their own internal Training Operator CRDs based on this library.
@johnugeorge @gaocegege @terrytangyuan @alculquicondor Are we aware of any users that are actively using this repository ?
Maybe we should stay in touch with them.

@terrytangyuan
Copy link
Member

My team at Ant Group was using it but it's no longer the case. I don't know any other teams that are relying on this repo.

@alculquicondor I think MPI operator repo can keep those constants itself so that we can remove the dependency on Kubeflow/common at once. WDYT?

@alculquicondor
Copy link

Would it be too bad to make mpi-operator depend on the constants from training-operator?

@terrytangyuan
Copy link
Member

Would it be too bad to make mpi-operator depend on the constants from training-operator?

I am not sure if that's a good idea as training-operator might be too heavy and there may be conflicts in dependencies. We only reply on very minimal set of constants so they should be easy to maintain.

@johnugeorge
Copy link
Member Author

I agree with @terrytangyuan . I think, It is not worth managing overhead in handling dependencies for mpi-operator to use minimal set of constants.(If that is the concern). I leave it to you @alculquicondor

Can folks give +1 to merge kubeflow/common to kubeflow/training-operator? We can take this up after creating release branch for KF release 1.7 during late Jan

@terrytangyuan
Copy link
Member

+1

@alculquicondor
Copy link

I guess the common repo would just be archived and we can still depend on it while we don't need new constants (likely to be true for a long time).

+1

@andreyvelich
Copy link
Member

+1
cc @kubeflow/wg-training-leads

@gaocegege
Copy link
Member

LGTM 👍
/lgtm

@gaocegege
Copy link
Member

@johnugeorge @gaocegege @terrytangyuan @alculquicondor Are we aware of any users that are actively using this repository ?

Some enterprise users may use it to write their own operators.

But I think they will lock on a specified version.

@johnugeorge
Copy link
Member Author

Based on the responses, kubeflow/common will be merged into kubeflow/training-operator post 1.7 release branch creation.

@alculquicondor
Copy link

Was a release branch opened?

@tenzen-y
Copy link
Member

tenzen-y commented Feb 3, 2023

Was a release branch opened?

@alculquicondor Yes, we opened RC.0 branch; https://github.com/kubeflow/training-operator/tree/v1.6-branch.
However, we will freeze to work in the kubeflow/common repository until cutting a final release.

kubeflow/common#209 (comment)

@alculquicondor
Copy link

are we still on freeze?

@johnugeorge
Copy link
Member Author

@alculquicondor Final release is not cut yet. I am expecting it to happen in a week

@tenzen-y
Copy link
Member

/assign @johnugeorge

@johnugeorge johnugeorge changed the title Discuss: Merge Kubeflow/common to training operator Merge Kubeflow/common to training operator May 22, 2023
@johnugeorge
Copy link
Member Author

Closing this as kubeflow/common is merged into training-operator

@Aunpuncode
Copy link

Kubeflow/common was initially created to have a common library for all frameworks in separate repos. Since we have already merged all operators into a single training operator, do we really need a separate common repository now?

Motivation:

  1. Many changes required for training operator requires changes in common code as well. This adds a huge overhead in maintenance. (Ref: Changing label selector type for HPA common#197, update k8s dependencies to 1.25 common#198, Remove deprecated labels common#200)
  2. Maintenance of separate kubeflow/common repo and keeping things in sync with training operator is an overhead (Also considering CI and testing)

Related: #1703

/cc @terrytangyuan @zw0610 @gaocegege @Jeffwan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants