Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address RMMAllocator error for UCX Endoscopy Tool Tracking application #604

Merged
merged 3 commits into from
Jan 22, 2025

Conversation

mocsharp
Copy link
Contributor

@mocsharp mocsharp commented Nov 27, 2024

This PR addresses a RMMAllocator crash when running the application with realtime set to false (play the video as fast as possible).

[error] [rmm_allocator.cpp:190] Unexpected error while allocating memory [00007]('video_replayer_allocator') : std::bad_alloc: out_of_memory: RMM failure at:bazel-out/k8-opt/bin/external/rmm/_virtual_includes/rmm/rmm/mr/device/pool_memory_resource.hpp:424: Maximum pool size exceeded
[error] [memory_buffer.hpp:79] video_replayer_allocator Failed to allocate 1229760 size of memory of type 1. Error code: GXF_FAILURE

As @grlee77 suggested, the fix is to explicitly set the RMM Allocator's initial and maximum sizes.

Adds the ability to enable/disable data flow benchmarking through configuration.

@mocsharp mocsharp requested a review from grlee77 November 27, 2024 18:20
@mocsharp mocsharp self-assigned this Nov 27, 2024
@tbirdso
Copy link
Contributor

tbirdso commented Dec 2, 2024

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

@mocsharp
Copy link
Contributor Author

mocsharp commented Dec 5, 2024

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

Updated description.

Copy link
Contributor

@tbirdso tbirdso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on defaults, code looks good otherwise. Will wait for approval from @grlee77 to merge.

@tbirdso
Copy link
Contributor

tbirdso commented Dec 12, 2024

ping @grlee77 for review

@jjomier
Copy link
Contributor

jjomier commented Jan 21, 2025

@grlee77 can you please review?

Copy link
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The change looks good to me. I will add an additional description motivating why it may be necessary.

@mocsharp, please either update the default value of benchmarking to false in the YAML or change the comment describing the default to be consistent. I will go ahead and approve, so you can merge after that without requesting another review

@grlee77
Copy link
Contributor

grlee77 commented Jan 22, 2025

@mocsharp could you please provide more details on the error and root cause? Are the RMMAllocator crash and benchmarking option related?

@tbirdso:
Copying some additional context from a prior discussion outside of GitHub. Item 2 below is why an allocator may need a bit larger memory capacity than one would naively estimate based on the tensors used in a single compute call.

In practice there can be some degree of "parallel" operation of operators even with the GreedyScheduler due to how CUDA kernels are launched aysnchronously from the host. Although the greedy scheduler can only call compute on one operator at a time, it is possible the compute method may return immediately after a CUDA kernel is launched allowing a subsequent compute method to be called while the work from that previous kernel is still being executed on the GPU. The ability to return once the kernel is launched is good from a performance standpoint, but it can lead to a couple potential issues that may cause confusion to application authors:

  1. Times reported for individual operators by Data Flow Tracking or GXF's job statistics (HOLOSCAN_ENABLE_GXF_JOB_STATISTICS=true) reflect the time spent in compute which may be misleadingly short in the case the kernels launched continued to run on GPU after compute has returned. For instance, if a subsequent operator then does an operation like device->host copy that requires synchronization, the remainder of the kernel computation time from the compute call from the previous operator would show up in that downstream operator's compute .
  2. If compute returns while computation is still being done on a tensor, an upstream operator is then free to be scheduled again. If that upstream operator was using a BlockMemoryPool or other pool, the memory from the prior compute call would still be in use. Thus the operator needs space to allocate a second tensor on top of the original one. This means the author has to set a larger number of required blocks (e.g. 2x as many) than they would have otherwise estimated.

I think v2.6 had some changes to InferenceOp where the operator's compute can return while the GPU kernels are still running (due to async launch of CUDA kernels) so this may be more likely to occur in newer releases than it was previously

@mocsharp mocsharp force-pushed the vchang/ucx_ett_rmm_settings branch from 70e91f7 to 4ca32c7 Compare January 22, 2025 18:32
@mocsharp mocsharp force-pushed the vchang/ucx_ett_rmm_settings branch from 4ca32c7 to dc7ded8 Compare January 22, 2025 18:33
@tbirdso
Copy link
Contributor

tbirdso commented Jan 22, 2025

Thanks all, looks good!

@tbirdso tbirdso merged commit 5497c26 into nvidia-holoscan:main Jan 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants