Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Bugfix][NCCL] Release NCCL thread_local resources in destructor (#17078
) Prior to this commit, allocations performed by `ncclCommInitRank` had no corresponding call to `ncclCommDestroy`. While `ncclCommDestroy` does occur in the `CCLThreadLocalContext::Clear` method, there are no calls into this method. On worker processes, the failure to call `ncclCommDestroy` typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates. However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the `CCLThreadLocalContext` leaked GPU memory, as the `ncclCommInitRank` call at the start of each `tvm.runtime.disco.ProcessSession` was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each `ProcessSession`. This commit updates `CCLThreadLocalContext` to have a destructor that calls the `Clear` method. For worker0, this is called when the thread is joined to the main thread.
- Loading branch information