Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

Open
msimberg opened this issue May 15, 2024 · 2 comments

Comments

@msimberg
Copy link
Contributor

msimberg commented May 15, 2024

Is your feature request related to a problem? Please describe.

The underlying allocator may have sufficient alignment, but aligned_allocate always overallocates to guarantee the alignment, even if it may not be necessary:

std::size_t total_bytes{ size + m_alignment };
. This wastes (a bit of) memory, and may cause performance issues with some MPI libraries.

Describe the solution you'd like

Avoid overallocation if the underlying allocator provides sufficiently aligned allocations.

Describe alternatives you've considered

Allow controlling alignment of backing buffers separately from alignment of user-facing allocations. The latter should probably never be larger than the former.

Additional context

This is really a feature request that comes from investigating what may be a bug in Cray MPICH, but I wanted to report it here as well since I think Umpire could in some situations do a better job (or I'm simply unaware of the knobs that Umpire has for controlling this, so looking for input in any case).

In our application we use Umpire's QuickPool to pool allocations of GPU buffers. QuickPool will use aligned_allocate to allocate backing buffers from e.g. CUDA, but if I ask for a 1 GiB buffer QuickPool will allocate 1 GiB plus alignment (16 by default) to guarantee that the allocation is aligned. It turns out that when using GPU-aware MPI communicating a buffer whose size isn't page-aligned (I think this is the requirement, but I'm still looking into the details) performance drops considerably. I'm separately reporting this issue to HPE.

I could set the alignment of the QuickPool to the page size to get an appropriately sized backing buffer, but if I understand correctly then all allocations on top of that will also have page-sized alignment, which is excessive for small allocations and can end up wasting a lot of memory. From what I can tell DynamicPoolList behaves the same as QuickPool (is there a reason to prefer one or the other by the way?).

Is there a way already to control the alignment of the backing buffers and "real" allocations on top of it separately? Is there another pool that we could use to get the behaviour we want?

Just out of curiousity since I couldn't find it, where is the code for ensuring that a QuickPool allocation starts at the correct alignment? I see the size is adjusted here:

const std::size_t rounded_bytes{aligned_round_up(bytes)};
. Edit: I realize this probably happens by construction. If the backing buffers have sufficient alignment and all the allocations have aligned sizes they'll be guaranteed to start aligned as well.

Thanks for your help!

@msimberg
Copy link
Contributor Author

Ping. Just checking if this is something interesting to you? I may be inclined to attempt implementing one of the options above if it sounds good to you. We're currently still stuck with the workaround where we have to overallocate all allocations by 2 MiB (the large page size on Grace CPUs).

@davidbeckingsale
Copy link
Member

@msimberg we would definitely be interested in a fix to avoid over-allocating the underlying buffers, and if it would be useful, adjusting the pool to take two alignment parameters (one for the allocations, and one for the buffers). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants