Lazily initialize the per-device attribute caches, because CUDA context · NVIDIA/cub@bac2060

This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Lazily initialize the per-device attribute caches, because CUDA context

creation is expensive and adds up with large CUDA binaries on machines with
many GPUs. This was making PyTorch slow and consuming lots of memory.

To implement this, I added an atomic status flag to each entry in the cache.
Each entry is in one of three states, empty, initializing, and ready.
Progression between states happens linearly.

Also:
- Add `cub::DeviceCount` and `cub::DeviceCountUncached`, caching
  abstractions for `cudaGetDeviceCount`.
- Make `cub::SwitchDevice` avoid setting/resetting the device if the current
  device is the same as the target device.

Bug 2884640

Reviewed-by: Michał 'Griwes' Dominiak <[email protected]>

Loading branch information

brycelelbach committed Mar 11, 2020

1 parent 6552e4d commit bac2060

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `bac2060`

Commit

There are no files selected for viewing

0 comments on commit bac2060

0 comments on commit `bac2060`