-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Aggregate received data into a large allocation #3453
Conversation
As we need to index into the object and `DeviceBuffer` currently lacks a way to index it, go ahead and coerce `DeviceBuffer`s to Numba `DeviceNDArray`s. This way we can be sure we will be able to index it when needed.
Interesting idea. I would be curious to see what performance differences are like. My hope would be that this is not necessary with TCP, and wouldn't have much of an effect with UCX if we're using RMM, but I wouldn't be surprised if myintuition was wrong here. |
The issue is RMM uses CNMEM under-the-hood, which is known to have issues freeing lots of memory (like small chunks). Keith mentioned this and added some numbers to back this up. So this effectively cuts down on the number of times free needs to be called per send, which should improve overall memory allocation performance. Of course some of this occurs on the critical path, but it also happens in other places too 😉 As to TCP, it really depends on how the Python memory manager works (given we are using I can't say at this point. Still sorting out bugs. It's worth noting that this may affect things outside of communication itself as it affects how allocations are done, which will affect other allocations that may not be in the communication path (like spilling or creating intermediates for example). So a simple benchmark UCX with/without this change is likely to miss many things or possibly focus on the wrong things. This is all conjecture, but do want to make sure we are aware of the forrest and not just one tree in it. 🙂 To summarize, the idea of the work here is to then cutdown the number of times alloc/free are called by some factor and then see how overall workflows are impacted. Anyways probably wouldn't follow this too closely for now as it is still WIP. Trying not to distract people while things are still in an early stage 😉 |
To summarize, the idea of the work here is to then cutdown the number of
times alloc/free are called by some factor and then see how overall
workflows are impacted.
+1. I look forward to seeing how this performs.
…On Fri, Feb 7, 2020 at 11:33 AM jakirkham ***@***.***> wrote:
The issue is RMM uses CNMEM under-the-hood, which is know to have issues
freeing lots of memory (like small chunks). Keith mentioned this
<rapidsai/ucx-py#402 (comment)>
and added some numbers to back this up
<rapidsai/ucx-py#402 (comment)>.
So this effectively cuts down on the number of times free needs to be
called per send, which should improve overall memory allocation
performance. Of course some of this occurs on the critical path, but it
also happens in other places too 😉
As to TCP, it really depends on how the Python memory manager works (given
we are using bytearray). That said, we might want to change that to
optionally use NumPy to benefit from the HUGEPAGES work (
numpy/numpy#14216 <numpy/numpy#14216> ), at which
point we would want to know how NumPy handles memory management. My guess
is it would be biased towards handling large allocations efficiently, but I
could always be wrong about this 🙂 In event this is largely orthogonal
and should be saved for a separate discussion thread (feel free to open one
if it is of interest). For now TCP is just handy for fixing bugs in my
approach 😄
I can't say at this point. Still sorting out bugs.
It's worth noting that this may affect things outside of communication
itself as it affects how allocations are done, which will affect other
allocations that may not be in the communication path (like spilling or
creating intermediates for example). So a simple benchmark UCX with/without
this change is likely to miss many things or possibly focus on the wrong
things. This is all conjecture, but do want to make sure we are aware of
the forrest and not just one tree in it. 🙂
To summarize, the idea of the work here is to then cutdown the number of
times alloc/free are called by some factor and then see how overall
workflows are impacted.
Anyways probably wouldn't follow this too closely for now as it is still
WIP. Trying not to distract people while things are still in an early stage
😉
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3453?email_source=notifications&email_token=AACKZTAYATCBGXUMHMITA73RBWZQJA5CNFSM4KRDSCPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELEIYSY#issuecomment-583568459>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHQRZQSCIKKCQQHPTTRBWZQJANCNFSM4KRDSCPA>
.
|
Checking in here. Is this still an active effort? No worries either way, I'm just going through old PRs and seeing what we can close out. |
I have had my hands full, but yes this remains important. |
Closing out in favor of PR ( #3732 ). |
Somewhat related to the approach that was tried in PR ( #5487 ) except that looked at |
Note: This is still being worked. So there may be bugs, test failures, bad behavior, etc. At this stage, all of these should be expected. You have been warned. 😉
As we have seen some issues due to churn and fragmentation from small allocations, this tries to allocate one large buffer up front before receiving. This buffer is then split up into small views onto that larger allocation. Once this setup is done we receive into all of these views. The views are then passed to deserialization much as the smaller buffers were before.
We take views onto the data so send and deserialization methods can work on data they are familiar with. A follow-up to this work might be to figure out how deserialization and receive can work on larger buffers to start with to avoid this split of the buffer into views. Though the intent is to not require this at this stage and only follow-up on it as needed later.
To help illustrate the motivation of this this change. Assume we are receiving a Dataframe. Previously we would create multiple allocations for each column along with their relevant internal data (masks, indices, etc.). With this change, we should make one host (and one device) allocation for the whole Dataframe and then fill out its content.
As a bonus of performing this allocation up front and parceling it into the relevant views, we are able to get rid of the old looping behavior around each receive and instead hand all of these to asyncio. This could potentially (though this yet to be verified) get more out of the available bandwidth and benefit from things like non-blocking sends.
Have added this change to both TCP and UCX. The former to aid in testing and discussion. The latter is of interest for improving performance in that use case. Though hopefully both will get a boost from this work.