Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Volcano job support scale up and down #782

Merged
merged 4 commits into from
May 8, 2020

Conversation

hzxuzhonghu
Copy link
Collaborator

@volcano-sh-bot volcano-sh-bot requested a review from k82cn April 24, 2020 08:45
@volcano-sh-bot
Copy link
Contributor

@hzxuzhonghu: GitHub didn't allow me to request PR reviews from the following users: zrss.

Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @k82cn @zrss

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 24, 2020
However, only when the job is not started, the initialization is run.
So we need a way to know whether it is a scale up/down event that triggered this round of sync.

The way I propose is to add a new event `JobUpdatedEvent` to indicate that the job is updated(here only cares about the scale up/down).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Pod was created/deleted, how /when to handle configmap? It's better to highlight the time sequence for the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a new OnJobUpdate method to the plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a new OnJobUpdate method to the plugin

So?

we need to highlight when create pod, when configmap was updated and which pod will be deleted when scale down.


### Admission webhook

Should prevent invalid mutating Job Spec on the fly. In this proposal, we only allow replicas update. Any other spec changes will be prohibited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's our expected behaviour if minMember & replicas does not match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid update, the api calling will fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's document it, if it's the case

## Motivation

Currently, Volcano does not support Job update. It is not allowed to update the `Job.Spec` on the fly.
However, users like ModelArts want to dynamically adjust Job's replicas according to the cluster idle resources
Copy link
Member

@k82cn k82cn Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not only for ModelArts; AFAIK, several ML framework already support elastic model, e.g. https://github.com/pytorch/elastic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will read about that for more context

@k82cn
Copy link
Member

k82cn commented Apr 26, 2020

/cc @Jeffwan

@volcano-sh-bot volcano-sh-bot requested a review from Jeffwan April 26, 2020 01:37
@k82cn k82cn added area/controllers kind/documentation Categorizes issue or PR as related to documentation. labels Apr 26, 2020
@k82cn k82cn added this to the v1.0 milestone Apr 26, 2020
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2020
@k82cn
Copy link
Member

k82cn commented Apr 30, 2020

LGTM overall :)

@zrss , please help to confirm whether that's behavour cover your cases :)


2. create pods when scale up

3. delete pods when scale down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pod of volcano job has corresponding headless service?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is similar to the statefulset pods, we can not delete the headless service when deleting the pods. But we will update the accessible hostfile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, Any reason headless service can not be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we just scale down, there are still some tasks maybe ps exist for tensoflow job. BTW, the headless service is deleted when the job completes or deleted.

And accordingly add a new action `UpdateJobAction` to run `UpdateJob` function. And the overall workflow is:
![workflow](images/Job-scale-up-down.PNG)

To scale up/down on the fly, Volcano should be responsible to notify the original pods the current status, including the hosts of all the pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need more update categories here.

  • pod template change
  • job replica change
  • job manifest change like annotation

I assume only specific change will trigger UpdateJobAction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of updates are prohibited.

@@ -0,0 +1,103 @@
# Volcano Job scale up and down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the scope of this proposal is to make sure controller can response to the job scale up and down. no scale up/down decision need to be made here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, correct.

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2020
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label May 7, 2020
@zrss
Copy link
Contributor

zrss commented May 8, 2020

lgtm, sorry for a long delay ... @hzxuzhonghu @k82cn

@k82cn
Copy link
Member

k82cn commented May 8, 2020

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2020
@k82cn
Copy link
Member

k82cn commented May 8, 2020

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu, Jeffwan, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2020
@volcano-sh-bot volcano-sh-bot merged commit 927e1d9 into volcano-sh:master May 8, 2020
@hzxuzhonghu hzxuzhonghu deleted the scale-up branch May 8, 2020 11:03
hzxuzhonghu added a commit that referenced this pull request May 8, 2020
…2-origin-release-0.4

Automated cherry pick of #782: Support scale up and down
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/controllers kind/documentation Categorizes issue or PR as related to documentation. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants