Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Deploy JobSet in kubeflow-system namespace #2388

Merged
merged 6 commits into from
Jan 27, 2025

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 14, 2025

Part of: #2170

Depends on: kubernetes-sigs/jobset#751, we should update the JobSet version once this fix is released.

I updated the Kubeflow Training V2 manifests to deploy JobSet in the kubeflow-system namespace. Also, given that the Training Runtimes need to be installed after the manager is deployed and is running, I think we should split out overlays between these two:

  • manager
  • runtimes

/cc @kannon92 @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @kubeflow/release-team @Electronic-Waste @Doris-xm @astefanutti @seanlaii @deepanker13 @saileshd1402 @vsoch

@google-oss-prow google-oss-prow bot requested review from a team, Electronic-Waste and deepanker13 January 14, 2025 18:47
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: seanlaii, saileshd1402, astefanutti, kubeflow/release-team, doris-xm, vsoch, kannon92, kubeflow/wg-manifests-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Depends on: kubernetes-sigs/jobset#752.
Part of: #2170

I updated the Kubeflow Training V2 manifests to deploy JobSet in the kubeflow-system namespace.

Also, given that the Training Runtimes need to be installed after the manager is deployed and is running, I think we should split out overlays between these two:

  • manager
  • runtimes

Cluster operators need to install manager first, and after than deploy runtimes.

/cc @kannon92 @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @kubeflow/release-team @Electronic-Waste @Doris-xm @astefanutti @seanlaii @deepanker13 @saileshd1402 @vsoch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Member Author

/hold for kubernetes-sigs/jobset#751

- ../../base/manager
- ../../base/rbac
- ../../base/webhook
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that make sense to group the JobSet related resources into a dedicated directory / overlay?
It could make things more modular and accommodate environments where the JobSet API / controller is already installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we can make it similar to KFP and move it into third-party/jobset/kustomization.yaml.
So the manifests will looks like this:

/v2/base/...
/v2/overlays/...
/v2/third-party/...

@kubeflow/wg-training-leads WDYT about it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich SGTM:)

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@coveralls
Copy link

coveralls commented Jan 26, 2025

Pull Request Test Coverage Report for Build 12977326722

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12908411340: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

@andreyvelich
Copy link
Member Author

I updated JobSet to v0.7.3.
We should be ready to merge this PR.
/cc @tenzen-y @astefanutti @Electronic-Waste @kannon92 @seanlaii

Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kannon92, seanlaii.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I updated JobSet to v0.7.3.
We should be ready to merge this PR.
/cc @tenzen-y @astefanutti @Electronic-Waste @kannon92 @seanlaii

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -0,0 +1,3 @@
apiVersion: config.jobset.x-k8s.io/v1alpha1
kind: Configuration
namespace: kubeflow-system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this.

https://github.com/kubernetes-sigs/jobset/blob/release-7.1/api/config/v1alpha1/defaults.go#L44

If jobset is not configured with a namespace option, it will take the namespace from the serviceaccount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point.
Even without my PR, it will take namespace from the SA.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@@ -1,3 +1,2 @@
apiVersion: config.jobset.x-k8s.io/v1alpha1
kind: Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just using the default configuration would be sufficient.

https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/controller_manager_config.yaml

I guess a nil configuration should be fine but it looks a bit weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way this is just a nit.

everything LGTM to me now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just keep this empty config, so in the future we can configure it.
I think, we can enable leader election in the future, if we want to have it by default.

@andreyvelich
Copy link
Member Author

/hold cancel
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

/assign @tenzen-y @Electronic-Waste

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Thanks for this @andreyvelich! Just a question for you.

Also cc👀 @Doris-xm. Currently jobset team has cherry-picked this PR kubernetes-sigs/jobset#719 into their newest release branch, unblocking your work #2382.

Comment on lines +16 to +17
patchesStrategicMerge:
- patches/jobset_remove_namespace.yaml # Remove namespace from the JobSet release manifests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also patch those namespace related configurations outside of the .metadata field? Like: https://github.com/kubeflow/training-operator/pull/2382/files#r1914094979

It seems that we also need to patch those webhook configurations to ensure jobsetworking correctly:)

Copy link
Member Author

@andreyvelich andreyvelich Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that even without this config, the kustomize correctly patches the JobSet's manifests with:

  clientConfig:
    service:
      name: jobset-webhook-service
      namespace: kubeflow-system
      path: /validate--v1-pod

I think, it works without this configuration since the JobSet manifests includes this service which tells kustomize that namespace in clientConfig needs to be patched.

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/created-by: jobset
    app.kubernetes.io/instance: webhook-service
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: service
    app.kubernetes.io/part-of: jobset
  name: jobset-webhook-service
  namespace: jobset-system
spec:
  ports:
    - port: 443
      protocol: TCP
      targetPort: 9443
  selector:
    control-plane: controller-manager

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@Electronic-Waste
Copy link
Member

/hold for the question

@Electronic-Waste
Copy link
Member

/lgtm
/hold cancel

@google-oss-prow google-oss-prow bot merged commit ee1b691 into kubeflow:master Jan 27, 2025
56 checks passed
@andreyvelich andreyvelich deleted the change-v2-manifests branch January 27, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants