Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway: fair share scheduling and resource limitation #570

Merged
merged 26 commits into from
Jun 5, 2023

Conversation

IceKhan13
Copy link
Member

@IceKhan13 IceKhan13 commented May 18, 2023

Summary

Gateway: fair share scheduling and resource limitation

Details and comments

Closes #254

all_diagrams-Gateway_ job scheduling drawio

@IceKhan13 IceKhan13 added enhancement New feature or request project: gateway Label to identify features related with gateway project labels May 18, 2023
@IceKhan13 IceKhan13 added this to the 0.2 milestone May 18, 2023
@psschwei
Copy link
Collaborator

What happened to the blue box? 😃

@IceKhan13
Copy link
Member Author

It felt like this blue color was screaming at me :)

@IceKhan13
Copy link
Member Author

@psschwei and @akihikokuroda

Only problem I have now is pvcs, like we discussed with Paul.

https://github.com/Qiskit-Extensions/quantum-serverless/blob/issue-254/limits/infrastructure/helm/quantumserverless/templates/pvcs.yaml

this is running fine within rancher desktop. What should we do with that?

@IceKhan13
Copy link
Member Author

Other than that implementation is more-or-less ready. Now each user has separate cluster which is killed if resources are not used. Spinning up and killing clusters is super fast (less that couple of seconds).

Now we can limit number of cluster that are running at the same time and limit amount of jobs per user.

While you are reviewing this I'm going to do a small stress tests on scheduler and maybe add db indexes to speed up fair-share scheduing

@IceKhan13 IceKhan13 marked this pull request as ready for review May 26, 2023 15:02
@psschwei
Copy link
Collaborator

Only problem I have now is pvcs, like we discussed with Paul.

For now, I'd just say that we're going to implement PVCs as part of #463 ... as a first step, we can just add an EmptyDir, which will let us test e2e on k8s. I don't think that's blocking for this PR, can just assume we'll implement later...

Full review to follow later today 😄

@psschwei
Copy link
Collaborator

psschwei commented May 26, 2023

@IceKhan13 somewhat random question: will it be possible to run the gateway without the scheduler or will it (the scheduler) be required going forward?

@IceKhan13
Copy link
Member Author

IceKhan13 commented May 31, 2023

Do we still need to create Ray cluster in the hem chart?

Nope, I will disable it

Copy link
Collaborator

@akihikokuroda akihikokuroda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!

@akihikokuroda
Copy link
Collaborator

I'm thinking if we can refactor 3 functions in ray.py using [from openshift.dynamic import DynamicClient](https://github.com/openshift/openshift-restclient-python/#create-a-route) instead of the kuberay apiserver after this PR is merged. We can have more flexibility in defining the ray nodes including PVC, InitContainer and sidecar container. WDYT @IceKhan13 @psschwei

@psschwei
Copy link
Collaborator

@akihikokuroda so essentially using the dynamic client to let us use yaml to create the object(s), right?

@akihikokuroda
Copy link
Collaborator

@psschwei Yes, we can use it to create raycluster CRD instance for the Kuberay operator.

@psschwei
Copy link
Collaborator

Cool, that's what I thought (but the link to a section called creating a route confused me a little bit, so just wanted to make sure)... I think in general I'm good with that, especially if it gives us access to things we need and makes it easier to create/configure clusters...

@IceKhan13
Copy link
Member Author

I like this approach even more, Aki!

Let's do it in a follow up PR and we will be able to add volumes and init containers! And we can actually remove api server then from deployment

Copy link
Collaborator

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought: given that Kuberay doesn't currently support PVCs, should we try something like minio in the docker compose (rather than a volume)? Though perhaps we can get around that with one of the various python clients, so perhaps a moot point?

Copy link
Collaborator

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: PVC permissions are fine, but the gateway-scheduler is failing:

Traceback (most recent call last):
gateway_scheduler  |   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
gateway_scheduler  |     return self.cursor.execute(sql, params)
gateway_scheduler  | psycopg2.errors.UndefinedColumn: column api_job.env_vars does not exist
gateway_scheduler  | LINE 1: ...", "api_job"."program_id", "api_job"."arguments", "api_job"....

@IceKhan13
Copy link
Member Author

I will check asap, but problem is docker image did not applied migrations. Therefore db does not have this column

@akihikokuroda
Copy link
Collaborator

I have issue installing via helm.

Error: INSTALLATION FAILED: YAML parse error on quantum-serverless/charts/gateway/templates/deployment.yaml: error converting YAML to JSON: yaml: line 17: did not find expected key

@akihikokuroda akihikokuroda self-requested a review June 1, 2023 15:59
@IceKhan13
Copy link
Member Author

I have issue installing via helm.

Fixed. Need to workout selector labels later.

edit: PVC permissions are fine, but the gateway-scheduler is failing:

Checked twice. Cannot reproduce it. Did you build new gateway image?
Can you confirm that gateway logs has this migrations lines in logs?

Screenshot 2023-06-01 at 2 03 06 PM

@IceKhan13 IceKhan13 requested a review from psschwei June 1, 2023 18:03
Copy link
Collaborator

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out my pull of your branch was missing the most recent commit(s)... it's working now that I've properly fetched all of them

dockerfile: infrastructure/docker/Dockerfile-gateway
entrypoint: "./scripts/scheduler.sh"
environment:
- DEBUG=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we set DEBUG=1 in the regular gateway, should we do the same here (or set the gateway to DEBUG=0)?

image: icr.io/quantum-public/quantum-serverless-gateway:${VERSION:-0.1.0}
entrypoint: "./scripts/scheduler.sh"
environment:
- DEBUG=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit re: debug value... maybe we turn debug off for this one and on for the dev one?

Copy link
Collaborator

@akihikokuroda akihikokuroda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woking fine now, thanks!

return HttpResponse("Hi! I'm ready.")
else:
return HttpResponse(status=503)
return HttpResponse("Hi! I'm ready.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent so I can open an issue for this, but definitely something that we will need to have to pass the security checks. In this end-point the team will ask us to check the status of the database too. So this check is not enough.

Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a first iteration I agree with the implementation. I left a couple of questions but they don't affect to the final result. Amazing work @IceKhan13 🎉 (thanks to @psschwei and @akihikokuroda too 🙏 ).

For production I would like to do a couple of suggestions to the code. Let me open a discussion and we can comment there how to advance with this 😄

gateway/entrypoint.sh Outdated Show resolved Hide resolved
gateway/api/signals.py Outdated Show resolved Hide resolved
gateway/api/ray.py Outdated Show resolved Hide resolved
gateway/api/ray.py Outdated Show resolved Hide resolved
@akihikokuroda
Copy link
Collaborator

akihikokuroda commented Jun 2, 2023

@IceKhan13 I created the refactor PR to this branch. If you prefer, I can create a new PR to main branch after this got merged. Thanks!

It seems that I still need to work with the tests :-(

@IceKhan13
Copy link
Member Author

@akihikokuroda let's merge your PR in a follow up, otherwise this already getting to big (already got 😅 )

Thank you all for great review!

@IceKhan13 IceKhan13 merged commit 081e70f into main Jun 5, 2023
@IceKhan13 IceKhan13 deleted the issue-254/limits branch June 5, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request project: gateway Label to identify features related with gateway project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Users: A way to limit number of jobs per user
4 participants