Out of Memory when working with Distributed for Small Matricies #2548

jarbus · 2024-11-07T22:49:46Z

Describe the bug

I'm performing matrix multiplication using multiple workers on a GPU, using Distributed. I limit the memory usage of CUDA using the env variables per the documentation, but oddly, these variables only seem to be effective when using large matrices of 1024x1024. The code below seems to ignore the memory restrictions, uses all available memory, and eventually OOMs due to what I believe is a race condition. Changing the 128 to 1024 in the following code results in each process restricting itself to just 2x the memory limit (around 10%), which prevents an OOM on my machine.

To reproduce

using Distributed
env = [
    "JULIA_CUDA_HARD_MEMORY_LIMIT" => "5%",
    "JULIA_CUDA_MEMORY_POOL" => "none"
]

n_workers = 6
addprocs(n_workers, env=env)

@everywhere begin
using CUDA
function matrix_multiply_on_gpu(worker_id)
    A = CUDA.rand(Float32, 128, 128)
    B = CUDA.rand(Float32, 128, 128)
    C = A * B

    return sum(C)
end
end
for i in 1:100_000
    pmap(matrix_multiply_on_gpu, 1:n_workers)
end

Manifest.toml

Using an environment with only CUDA.jl#master installed, will update with a Manifest.toml if needed

Expected behavior

I expect each process to limit itself to 5% of the CPU memory, or at least, some maximum amount.

Version info

Details on Julia:


Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH =

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A4000 (sm_86, 1.528 GiB / 15.992 GiB available)

Additional context

This is for a research project, where I want to distribute a workload across many processes on multiple machines, each process utilizing a small amount of one of the GPUs.

The text was updated successfully, but these errors were encountered:

maleadt · 2024-12-10T11:21:49Z

Sorry for the slow response. See the linked PR for a fix. The problem is reproducible without Distributed:

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "5%"
using CUDA

function matrix_multiply_on_gpu(worker_id)
    CUDA.memory_status()

    @async(begin
        A = CUDA.rand(Float32, 128, 128)
        B = CUDA.rand(Float32, 128, 128)

        C = A * B

        sum(C)
    end) |> fetch
end

while true
    matrix_multiply_on_gpu(1)
end

This also highlights the underlying issue: Distributed uses a new task for each pmap invocation, while CUDA.jl caches CUBLAS handles at the task level. In fact, this isn't strictly a CUDA.jl issue, the problem is that Julia doesn't free up tasks quickly enough for the GPU memory to recover. In #2583, I add a GC.gc(false) call at a strategic place in the handle cache lookup, hopefully fixing this problem without too much of a performance overhead.

jarbus · 2024-12-10T17:40:59Z

Thanks @maleadt! Better late than never, this is a really, really helpful fix for me, and I suspect it will solve OOM quirks I've been seeing elsewhere.

jarbus added the bug Something isn't working label Nov 7, 2024

maleadt mentioned this issue Dec 10, 2024

Run the GC when failing to find a handle, but lots are active. #2583

Merged

maleadt closed this as completed in #2583 Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory when working with Distributed for Small Matricies #2548

Out of Memory when working with Distributed for Small Matricies #2548

jarbus commented Nov 7, 2024

maleadt commented Dec 10, 2024

jarbus commented Dec 10, 2024

Out of Memory when working with Distributed for Small Matricies #2548

Out of Memory when working with Distributed for Small Matricies #2548

Comments

jarbus commented Nov 7, 2024

maleadt commented Dec 10, 2024

jarbus commented Dec 10, 2024