-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway: fair share scheduling and resource limitation #570
Conversation
…rogram to db to schedule in later in separate process
What happened to the blue box? 😃 |
It felt like this blue color was screaming at me :) |
Only problem I have now is pvcs, like we discussed with Paul. this is running fine within rancher desktop. What should we do with that? |
Other than that implementation is more-or-less ready. Now each user has separate cluster which is killed if resources are not used. Spinning up and killing clusters is super fast (less that couple of seconds). Now we can limit number of cluster that are running at the same time and limit amount of jobs per user. While you are reviewing this I'm going to do a small stress tests on scheduler and maybe add db indexes to speed up fair-share scheduing |
For now, I'd just say that we're going to implement PVCs as part of #463 ... as a first step, we can just add an EmptyDir, which will let us test e2e on k8s. I don't think that's blocking for this PR, can just assume we'll implement later... Full review to follow later today 😄 |
infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml
Show resolved
Hide resolved
infrastructure/helm/quantumserverless/charts/gateway/templates/job.yaml
Outdated
Show resolved
Hide resolved
@IceKhan13 somewhat random question: will it be possible to run the gateway without the scheduler or will it (the scheduler) be required going forward? |
Nope, I will disable it |
infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work!
I'm thinking if we can refactor 3 functions in ray.py using |
@akihikokuroda so essentially using the dynamic client to let us use yaml to create the object(s), right? |
@psschwei Yes, we can use it to create raycluster CRD instance for the Kuberay operator. |
Cool, that's what I thought (but the link to a section called creating a route confused me a little bit, so just wanted to make sure)... I think in general I'm good with that, especially if it gives us access to things we need and makes it easier to create/configure clusters... |
I like this approach even more, Aki! Let's do it in a follow up PR and we will be able to add volumes and init containers! And we can actually remove api server then from deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Food for thought: given that Kuberay doesn't currently support PVCs, should we try something like minio in the docker compose (rather than a volume)? Though perhaps we can get around that with one of the various python clients, so perhaps a moot point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: PVC permissions are fine, but the gateway-scheduler is failing:
Traceback (most recent call last):
gateway_scheduler | File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
gateway_scheduler | return self.cursor.execute(sql, params)
gateway_scheduler | psycopg2.errors.UndefinedColumn: column api_job.env_vars does not exist
gateway_scheduler | LINE 1: ...", "api_job"."program_id", "api_job"."arguments", "api_job"....
I will check asap, but problem is docker image did not applied migrations. Therefore db does not have this column |
I have issue installing via helm.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out my pull of your branch was missing the most recent commit(s)... it's working now that I've properly fetched all of them
dockerfile: infrastructure/docker/Dockerfile-gateway | ||
entrypoint: "./scripts/scheduler.sh" | ||
environment: | ||
- DEBUG=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we set DEBUG=1 in the regular gateway, should we do the same here (or set the gateway to DEBUG=0)?
image: icr.io/quantum-public/quantum-serverless-gateway:${VERSION:-0.1.0} | ||
entrypoint: "./scripts/scheduler.sh" | ||
environment: | ||
- DEBUG=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit re: debug value... maybe we turn debug off for this one and on for the dev one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woking fine now, thanks!
return HttpResponse("Hi! I'm ready.") | ||
else: | ||
return HttpResponse(status=503) | ||
return HttpResponse("Hi! I'm ready.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not urgent so I can open an issue for this, but definitely something that we will need to have to pass the security checks. In this end-point the team will ask us to check the status of the database too. So this check is not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a first iteration I agree with the implementation. I left a couple of questions but they don't affect to the final result. Amazing work @IceKhan13 🎉 (thanks to @psschwei and @akihikokuroda too 🙏 ).
For production I would like to do a couple of suggestions to the code. Let me open a discussion and we can comment there how to advance with this 😄
@IceKhan13 I created the refactor PR to this branch. If you prefer, I can create a new PR to main branch after this got merged. Thanks! It seems that I still need to work with the tests :-( |
@akihikokuroda let's merge your PR in a follow up, otherwise this already getting to big (already got 😅 ) Thank you all for great review! |
Summary
Gateway: fair share scheduling and resource limitation
Details and comments
Closes #254