Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

mocsharp · 2024-11-27T18:20:00Z

This PR addresses a RMMAllocator crash when running the application with realtime set to false (play the video as fast as possible).

[error] [rmm_allocator.cpp:190] Unexpected error while allocating memory [00007]('video_replayer_allocator') : std::bad_alloc: out_of_memory: RMM failure at:bazel-out/k8-opt/bin/external/rmm/_virtual_includes/rmm/rmm/mr/device/pool_memory_resource.hpp:424: Maximum pool size exceeded
[error] [memory_buffer.hpp:79] video_replayer_allocator Failed to allocate 1229760 size of memory of type 1. Error code: GXF_FAILURE

As @grlee77 suggested, the fix is to explicitly set the RMM Allocator's initial and maximum sizes.

Adds the ability to enable/disable data flow benchmarking through configuration.

tbirdso · 2024-12-02T14:17:37Z

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

mocsharp · 2024-12-05T08:38:03Z

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

Updated description.

tbirdso

One question on defaults, code looks good otherwise. Will wait for approval from @grlee77 to merge.

applications/distributed/ucx/ucx_endoscopy_tool_tracking/cpp/endoscopy_tool_tracking.yaml

tbirdso · 2024-12-12T16:29:37Z

ping @grlee77 for review

jjomier · 2025-01-21T20:15:06Z

@grlee77 can you please review?

grlee77

Thanks. The change looks good to me. I will add an additional description motivating why it may be necessary.

@mocsharp, please either update the default value of benchmarking to false in the YAML or change the comment describing the default to be consistent. I will go ahead and approve, so you can merge after that without requesting another review

applications/distributed/ucx/ucx_endoscopy_tool_tracking/cpp/endoscopy_tool_tracking.yaml

grlee77 · 2025-01-22T14:10:48Z

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

@tbirdso:
Copying some additional context from a prior discussion outside of GitHub. Item 2 below is why an allocator may need a bit larger memory capacity than one would naively estimate based on the tensors used in a single compute call.

In practice there can be some degree of "parallel" operation of operators even with the GreedyScheduler due to how CUDA kernels are launched aysnchronously from the host. Although the greedy scheduler can only call compute on one operator at a time, it is possible the compute method may return immediately after a CUDA kernel is launched allowing a subsequent compute method to be called while the work from that previous kernel is still being executed on the GPU. The ability to return once the kernel is launched is good from a performance standpoint, but it can lead to a couple potential issues that may cause confusion to application authors:

Times reported for individual operators by Data Flow Tracking or GXF's job statistics (HOLOSCAN_ENABLE_GXF_JOB_STATISTICS=true) reflect the time spent in compute which may be misleadingly short in the case the kernels launched continued to run on GPU after compute has returned. For instance, if a subsequent operator then does an operation like device->host copy that requires synchronization, the remainder of the kernel computation time from the compute call from the previous operator would show up in that downstream operator's compute .
If compute returns while computation is still being done on a tensor, an upstream operator is then free to be scheduled again. If that upstream operator was using a BlockMemoryPool or other pool, the memory from the prior compute call would still be in use. Thus the operator needs space to allocate a second tensor on top of the original one. This means the author has to set a larger number of required blocks (e.g. 2x as many) than they would have otherwise estimated.

I think v2.6 had some changes to InferenceOp where the operator's compute can return while the GPU kernels are still running (due to async launch of CUDA kernels) so this may be more likely to occur in newer releases than it was previously

…me=false Signed-off-by: Victor Chang <[email protected]>

Signed-off-by: Victor Chang <[email protected]>

tbirdso · 2025-01-22T19:29:29Z

Thanks all, looks good!

mocsharp requested a review from grlee77 November 27, 2024 18:20

mocsharp self-assigned this Nov 27, 2024

tbirdso reviewed Dec 5, 2024

View reviewed changes

applications/distributed/ucx/ucx_endoscopy_tool_tracking/cpp/endoscopy_tool_tracking.yaml Outdated Show resolved Hide resolved

grlee77 approved these changes Jan 22, 2025

View reviewed changes

applications/distributed/ucx/ucx_endoscopy_tool_tracking/cpp/endoscopy_tool_tracking.yaml Outdated Show resolved Hide resolved

mocsharp force-pushed the vchang/ucx_ett_rmm_settings branch from 70e91f7 to 4ca32c7 Compare January 22, 2025 18:32

mocsharp added 3 commits January 22, 2025 10:33

Address RMMAllocator error allocating memory when running with realti…

d72d69e

…me=false Signed-off-by: Victor Chang <[email protected]>

Enable/disable benchmarking through config

b6ba5c8

Signed-off-by: Victor Chang <[email protected]>

Update defaults

dc7ded8

Signed-off-by: Victor Chang <[email protected]>

mocsharp force-pushed the vchang/ucx_ett_rmm_settings branch from 4ca32c7 to dc7ded8 Compare January 22, 2025 18:33

tbirdso merged commit 5497c26 into nvidia-holoscan:main Jan 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

mocsharp commented Nov 27, 2024 •

edited

Loading

tbirdso commented Dec 2, 2024

mocsharp commented Dec 5, 2024

tbirdso left a comment

tbirdso commented Dec 12, 2024

jjomier commented Jan 21, 2025

grlee77 left a comment

grlee77 commented Jan 22, 2025

tbirdso commented Jan 22, 2025

Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

Conversation

mocsharp commented Nov 27, 2024 • edited Loading

tbirdso commented Dec 2, 2024

mocsharp commented Dec 5, 2024

tbirdso left a comment

Choose a reason for hiding this comment

tbirdso commented Dec 12, 2024

jjomier commented Jan 21, 2025

grlee77 left a comment

Choose a reason for hiding this comment

grlee77 commented Jan 22, 2025

tbirdso commented Jan 22, 2025

mocsharp commented Nov 27, 2024 •

edited

Loading