-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU nodepools to CarbonPlan's Azure cluster #931
Conversation
Terraform LGTM. Will need an entry in profile list targetting this node pool, and might also need some extra work to make sure the drivers are available inside the image - maybe the pangeo ML image already has those? I use the |
Yeah, I was going to do this in a separate PR as I think it's too complicated to do terraform things and JupyterHub things in one go. Updating the profile list is being tracked in #930
This hub is using their own image: |
Makes sense!
Ah cool! However, I looked up what would need to be done, and it isn't just what needs to be done at the user level - see https://docs.microsoft.com/en-us/azure/aks/gpu-cluster. Either we need to get Azure to use a different base image for GPU, or setup an additional daemonset to install the driver on the node. This is because the nvidia driver isn't actually open source. This is unfortunately true on almost all kubernetes providers. |
From hashicorp/terraform-provider-azurerm#6793 it looks like the custom base image option might not be available to us, and we'd need to deploy the daemonset. Maybe it can be part of the support chart and enabled with a flag? Doesn't need to be part of this PR tho! |
@yuvipanda This was the same conclusion I was coming to as well! I'm just going to push a few more small changes to this PR that will allow us to apply labels to specific nodepools, like we can with the GKE terraform config, but also taints too. So I can deploy this with the recommended |
@yuvipanda Pushed those updates and added new tf plan output to top comment. LMK what you think! \o/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Ran into this error on apply, the main cluster may need upgrading
|
Commit |
This has been successfully applied to the cluster! 🎉 🚀 |
Thanks @sgibson91 and @yuvipanda for documenting all of these steps as well :-) |
This PR is the first step towards #930 and provides GPU machines to notebook and dask workers. It also adds support for adding labels and taints to specific nodepools in a similar manner to what is implemented in our GCP config.
NOTE: Due to #890, I had to run a bespoke
terraform plan
command that only targeted the cluster and nodepools.Full
terraform plan
command (with some escaping to make[]
and"
work with my shell):Full output: