Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Make Replicas optional #1120

Closed
1 of 2 tasks
nvtkaszpir opened this issue May 27, 2023 · 5 comments · Fixed by #1443
Closed
1 of 2 tasks

[Feature] Make Replicas optional #1120

nvtkaszpir opened this issue May 27, 2023 · 5 comments · Fixed by #1443
Labels
1.0 enhancement New feature or request

Comments

@nvtkaszpir
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

When you use gitops (ArgoCD /Flux) deployments for raycluster with autoscaler enabled, then there is an option to have constant fight of the ArgoCD and autoscaler on managing custom resource replicas, resulting in pods created/terminated.

Use case

Example scenario to reproduce the issue:

  • have ArgoCD to manage resources in namespace from git repo, enforce resource change to be in sync with repo via watching changes mechanism (such as recreate removed resources, revert changes to the state that is in the git repo)
  • define raycluster with autoscaling enabled, replicas 0, min replicas 0, max replicas 2, ensure head node does not take any workload (cpus:0, gpus:0)
  • commit changes to the repo, cluster is created
  • schedule any ray job to be done, this triggers autoscaler to take action
  • autoscaler changes raycluster replicas to desired state, let say 2, kuberay/autoscaler (not sure which one) spawns new pods
  • ArgoCD sees change to raycluster replicas and notices that its value differs from what is defined in git repo, so it reverts back replicas to 0, this causes the pods to be terminated
  • autoscaler detects the pod number is incorrect, and changes replicas to desired state, creating new pods...
  • and it loops forever, new pods are quickly scheduled to be terminated and no workload is done.

Solving this can be done in two ways:

  • not using gitops tools, but then it that may be impossible due to company policy
  • not using autoscaler - setting strict number of replicas or having equal min/max will not trigger the issue
  • ensuring gitops tools ignore changes related to certain fields, in here replicas, for example https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#application-level-configuration
  • in CRD not requiting replicas to be set (right now it is), thus three way merge that gitops tools usually do allow to have certain fields to be changed and not triggering reverting its state.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@nvtkaszpir nvtkaszpir added the enhancement New feature or request label May 27, 2023
@manifest
Copy link

manifest commented Jun 4, 2023

We use Flux CD and have got the same issue.

The solution to this problem seems rather simple, spec.replicas must not be a required field of RayCluster CRD.

This way Flux CD will not be reverting changes made by the autoscaler on every reconciliation. There is a similar issue with Flux CD and HPA.

@manifest
Copy link

manifest commented Jun 4, 2023

I can make a PR if the change is OK. This change will require to set the default value of spec.replicas to 0 (now it is 1).

@nvtkaszpir
Copy link
Author

for ArgoCD in app config it is required to set up ignore differences (https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/) such as:

apiVersion: argoproj.io/v1alpha
kind: Application
spec:
...
  ignoreDifferences:
  - kind: RayCluster
    group: ray.io
    jqPathExpressions:
    - .spec.workerGroupSpecs[].replicas
    - .spec.workerGroupSpecs[].scaleStrategy

@kevin85421 kevin85421 self-assigned this Jun 9, 2023
@kevin85421
Copy link
Member

Hi @manifest, would you be willing to file a pull request? If you don't have the bandwidth, I can take it. We have already added this item to our internal roadmap. Thanks!

@kevin85421 kevin85421 removed their assignment Jun 9, 2023
@manifest
Copy link

manifest commented Jun 9, 2023

I've made a PR. I'm not sure how it should be tested, since tests are failing on the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 enhancement New feature or request
Projects
None yet
3 participants