Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory when working with Distributed for Small Matricies #2548

Closed
jarbus opened this issue Nov 7, 2024 · 2 comments · Fixed by #2583
Closed

Out of Memory when working with Distributed for Small Matricies #2548

jarbus opened this issue Nov 7, 2024 · 2 comments · Fixed by #2583
Labels
bug Something isn't working

Comments

@jarbus
Copy link
Contributor

jarbus commented Nov 7, 2024

Describe the bug

I'm performing matrix multiplication using multiple workers on a GPU, using Distributed. I limit the memory usage of CUDA using the env variables per the documentation, but oddly, these variables only seem to be effective when using large matrices of 1024x1024. The code below seems to ignore the memory restrictions, uses all available memory, and eventually OOMs due to what I believe is a race condition. Changing the 128 to 1024 in the following code results in each process restricting itself to just 2x the memory limit (around 10%), which prevents an OOM on my machine.

To reproduce

using Distributed
env = [
    "JULIA_CUDA_HARD_MEMORY_LIMIT" => "5%",
    "JULIA_CUDA_MEMORY_POOL" => "none"
]

n_workers = 6
addprocs(n_workers, env=env)

@everywhere begin
using CUDA
function matrix_multiply_on_gpu(worker_id)
    A = CUDA.rand(Float32, 128, 128)
    B = CUDA.rand(Float32, 128, 128)
    C = A * B

    return sum(C)
end
end
for i in 1:100_000
    pmap(matrix_multiply_on_gpu, 1:n_workers)
end
Manifest.toml

Using an environment with only CUDA.jl#master installed, will update with a Manifest.toml if needed

Expected behavior

I expect each process to limit itself to 5% of the CPU memory, or at least, some maximum amount.

Version info

Details on Julia:


Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH = 

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A4000 (sm_86, 1.528 GiB / 15.992 GiB available)

Additional context

This is for a research project, where I want to distribute a workload across many processes on multiple machines, each process utilizing a small amount of one of the GPUs.

@maleadt
Copy link
Member

maleadt commented Dec 10, 2024

Sorry for the slow response. See the linked PR for a fix. The problem is reproducible without Distributed:

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "5%"
using CUDA

function matrix_multiply_on_gpu(worker_id)
    CUDA.memory_status()

    @async(begin
        A = CUDA.rand(Float32, 128, 128)
        B = CUDA.rand(Float32, 128, 128)

        C = A * B

        sum(C)
    end) |> fetch
end

while true
    matrix_multiply_on_gpu(1)
end

This also highlights the underlying issue: Distributed uses a new task for each pmap invocation, while CUDA.jl caches CUBLAS handles at the task level. In fact, this isn't strictly a CUDA.jl issue, the problem is that Julia doesn't free up tasks quickly enough for the GPU memory to recover. In #2583, I add a GC.gc(false) call at a strategic place in the handle cache lookup, hopefully fixing this problem without too much of a performance overhead.

@jarbus
Copy link
Contributor Author

jarbus commented Dec 10, 2024

Thanks @maleadt! Better late than never, this is a really, really helpful fix for me, and I suspect it will solve OOM quirks I've been seeing elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants