Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Permalink
Lazily initialize the per-device attribute caches, because CUDA context
Browse files Browse the repository at this point in the history
creation is expensive and adds up with large CUDA binaries on machines with
many GPUs. This was making PyTorch slow and consuming lots of memory.

To implement this, I added an atomic status flag to each entry in the cache.
Each entry is in one of three states, empty, initializing, and ready.
Progression between states happens linearly.

Also:
- Add `cub::DeviceCount` and `cub::DeviceCountUncached`, caching
  abstractions for `cudaGetDeviceCount`.
- Make `cub::SwitchDevice` avoid setting/resetting the device if the current
  device is the same as the target device.

Bug 2884640

Reviewed-by: Michał 'Griwes' Dominiak <[email protected]>
  • Loading branch information
brycelelbach committed Mar 11, 2020
1 parent 6552e4d commit bac2060
Showing 1 changed file with 230 additions and 73 deletions.
Loading

0 comments on commit bac2060

Please sign in to comment.