Highlights
-
Toolkit adds support for A4 machine family including blueprints and documentation using (GKE/ Slurm)
-
DWS Flex support introduced for GKE.
-
Support for persistent Slurm controller state
What's Changed
Key New Features 🎉
- Support DWS Flex on GKE by @SwarnaBharathiMantena in #3636
- GCSFuse cache enabled a3-mega blueprint by @koallison in #3460
- Add controller save state disk by @alyssa-sm in #3661
- Toolkit GKE now supports the A4 machine family. Blueprints and documentation are now available. by @SwarnaBharathiMantena and @annuay-google in #3703 #3704 #3702 #3718 #3705 #3656 #3657 #3719
- Add A4 slurm blueprints by @harshthakkar01 in #3709
Module Improvements 🔨
- Support Kueue 0.10.1 by @annuay-google in #3620
- enable queued provisioning on gke nodepool by @SwarnaBharathiMantena in #3582
- update GPU and disk definitions for A4 by @SwarnaBharathiMantena in #3657
Improvements 🛠
- Remove explicitly stated mtu in gke-a3-ultragpu as default mtu has sa… by @parulbajaj01 in #3619
- Add cuda-toolkit to a3ultra-jbvms blueprint by @RachaelSTamakloe in #3615
- Add missing indexes and a3U documentation in readme for gke blueprints by @parulbajaj01 in #3599
- Add variable for k8s service account and remove hardcoded value by @parulbajaj01 in #3634
- Set defaults for GPU driver, disk type and Jobset version for A3U blueprints by @annuay-google in #3679
- Standardize naming prefixes for kubernetes network objects by @parulbajaj01 in #3644
- Remove autoscaling max nodes from A3H and A3M tests by @parulbajaj01 in #3696
- A4 GKE integration test by @annuay-google in #3718
Deprecations 💤
- Deprecate gke topology scheduler by @annuay-google in #3678
Version Updates ⏫
- Update NeMo framework examples to 24.12 by @akiki-liang0 in #3616
- Pin to latest TPG v6.20.0 minor release by @abbas1902 in #3669
Bug fixes 🐞
- Fixes HPL benchmark test due to WARMUP_END_PROG environment variable. by @samskillman in #3631
- Increase google and google-beta provider versions for GKE cluster by @annuay-google in #3635
- Fix guest accelerator (broken for GKE) by @annuay-google in #3656
- Enable NVIDIA DCGM in A3 Ultra Slurm blueprint by @tpdownes in #3673
- Fix htcondor config by @lemaitre-aneo in #3664
- Fix ordering of local SSD mounting and docker by @tpdownes in #3682
Full Changelog: v1.46.1...v1.47.0