This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Lazily initialize the per-device attribute caches, because CUDA context
creation is expensive and adds up with large CUDA binaries on machines with many GPUs. This was making PyTorch slow and consuming lots of memory. To implement this, I added an atomic status flag to each entry in the cache. Each entry is in one of three states, empty, initializing, and ready. Progression between states happens linearly. Also: - Add `cub::DeviceCount` and `cub::DeviceCountUncached`, caching abstractions for `cudaGetDeviceCount`. - Make `cub::SwitchDevice` avoid setting/resetting the device if the current device is the same as the target device. Bug 2884640 Reviewed-by: Michał 'Griwes' Dominiak <[email protected]>
- Loading branch information