Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support user defined resource quota. #1432

Closed
BalaBalaYi opened this issue Jan 10, 2025 · 1 comment · Fixed by #1438
Closed

Support user defined resource quota. #1432

BalaBalaYi opened this issue Jan 10, 2025 · 1 comment · Fixed by #1438
Assignees
Labels
feature todo issue or pr with 'todo' will ignore expiration
Milestone

Comments

@BalaBalaYi
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Need to specified master's resource quota.

Describe the solution you'd like
We can enable setting the master pod spec in the elasticjob like a worker pod spec. Then, the user can configure the command and resource of the master pod.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@BalaBalaYi BalaBalaYi added todo issue or pr with 'todo' will ignore expiration feature labels Jan 10, 2025
@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 10, 2025
@workingloong
Copy link
Collaborator

Now, user can specify the master source like worker because the replicaSpecs is a map.

ReplicaSpecs map[commonv1.ReplicaType]*ReplicaSpec `json:"replicaSpecs"`

pod := common.NewPod(
job, &job.Spec.ReplicaSpecs[ReplicaTypeJobMaster].Template, masterName,
)

For example:

---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
  name: fine-tuning-llama2
  namespace: dlrover
spec:
  distributionStrategy: AllreduceStrategy
  optimizeMode: single-job
  replicaSpecs:
    worker:
      replicas: 2
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              # yamllint disable-line rule:line-length
              image: registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:llama-finetuning
              imagePullPolicy: Always
              command:
                - /bin/bash
                - -c
                - "dlrover-run --nnodes=$NODE_NUM \
                  --nproc_per_node=1 --max_restarts=1  \
                  ./examples/pytorch/llama2/fine_tuning.py  \
                  ./examples/pytorch/llama2/btc_tweets_sentiment.json"
              resources:
                limits:
                  cpu: "8"
                  memory: 16Gi
                  nvidia.com/gpu: 1  # optional
                requests:
                  cpu: "4"
                  memory: 16Gi
                  nvidia.com/gpu: 1  # optional
    dlrover-master:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              # yamllint disable-line rule:line-length
              image: easydl/elasticjob-controller:master
              imagePullPolicy: Always
              command:
                - /bin/bash
                - -c
                - "python -m  dlrover.python.master.main --xxx"
              resources:
                limits:
                  cpu: "1"
                  memory: 2Gi
                  nvidia.com/gpu: 1  # optional
                requests:
                  cpu: "1"
                  memory: 2Gi
                  nvidia.com/gpu: 1  # optional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature todo issue or pr with 'todo' will ignore expiration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants