-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the update of RapidsShuffleManager to resolve the bottleneck for waiting to acquire the semaphore #5650
Comments
My SQL contains three tables and two joins, I used the SQL Tab of Spark WebUI, analyzed the details of the query, list the time cost of each node of the task within the last spark stage and compared them with CPU mode. I attached it here. It seems that GPU operator like Sort, HashAggregate are quitely faster than those correspoding CPU operator, howerver , GPU mode contains additional operators like GpuShuffleCoalesce and GpuCoaleseBatches, and these two operator cost too much times which slow down the total performance. |
The metrics above do not indicate GpuShuffleCoalesce and GpuCoalesceBatches are directly the issue. For both operations, the Regarding the GPU semaphore, which version of the RAPIDS Accelerator are you using? We have added an explicit metric for time spent waiting for the GPU semaphore in many SQL UI nodes which is available in release 21.12 and after. It is disabled by default to keep the number of metrics manageable for the driver, but it can be enabled in those releases by setting We have also made strides in recent releases to avoid holding the GPU semaphore while not actively processing data on the GPU (e.g.: #4588 and #4476). There are still instances where this can happen, and the problem is quite complex. For example, it would be relatively easy to always release the GPU semaphore when performing shuffle I/O or other host-based operations, but doing so while data remains in GPU memory allows new tasks to start adding data to GPU memory and it can easily lead to an OOM or heavy thrashing with excessive memory spill. The thrashing can be so bad that it can be faster to hold onto the semaphore and prevent too many tasks from trying to use the GPU simultaneously even though it seems wasteful at first glance. It all depends on how fast the network is, how fast local disks are, how much memory new tasks will add to the GPU before the I/O completes, etc. etc. It is a tricky problem with many variables that are difficult to predict. |
I updated the version of RAPIDS Accelerator to 22.04 and run the SQL again. |
Work is planned in the 22.08 release cycle to look for improvement opportunities related to semaphore acquisition: #4568. |
There's quite a bit of double-counting in this calculation. For example, the The |
yea, nice comment, great thx @jlowe
Now as I have updated the version of RAPIDS Accelerator from 21.10 to 22.04. I found that the collect batch time in GPUShuffleCoalesce and GPUCoalesceBatches disappeared, and yes, the GPU semaphore wait time comes in at 3.5s, and it comes to be 3.2/5s, nearly 60% proportion in the task time cost. I list the detail here. So the question is, the bottleneck is GPU semaphore or shuffle read? BTW, I didn't use Rapids shuffle manager here, as before we tested and its result would be worse, maybe after the 22.8 release, we could take a try ? |
Clearly semaphore wait is the main bottleneck for this task, but it's difficult to say for sure whether there is shuffle being performed while the semaphore is being held (generally undesirable when this occurs but it does happen in some cases). Note that semaphore wait in itself is not necessarily something that needs to be eliminated at all costs, as the semaphore's intent is to prevent situations where adding more concurrent tasks to the GPU may lead to an out-of-memory error on the GPU. For example, if you have very many concurrent tasks configured for an executor (e.g.: 256 cores) but a relatively small GPU (e.g.: T4 with only 16GB) then it is fully expected to see a relatively high semaphore wait time since most tasks will be waiting their turn to use the GPU. If all 256 concurrent tasks tried to use the T4 at the same time, it's very likely the GPU will run out of memory. There are a couple of ways to help reduce semaphore wait time. First is seeing if you can run more concurrent tasks on the GPU by configuring |
The main bottleneck during this portion of the query is waiting to acquire the semaphore. However the tasks owning the semaphore are not making full use of the GPU. The main portion of time they are spending is decompressing shuffle data on the CPU and copying it down to the GPU. They need to own the GPU semaphore during this phase because they are placing data onto the GPU. The whole point of the GPU semaphore is to prevent too many tasks from placing data onto the GPU at the same time and exhausting the GPU's memory.
Essentially the main bottleneck in that stage is dealing with the shuffle data and transfer to the GPU, because that's what's taking so long for the tasks holding the GPU semaphore to release it. Once the shuffle data is loaded on the GPU, the rest of the stage processing is quite fast. The RapidsShuffleManager was designed explicitly to target this shuffle problem, as it tries to keep shuffle targets in GPU memory and not rely on the CPU for compression/decompression which can be a bottleneck. Unfortunately there are a number of issues with RapidsShuffleManager that prevent it from working well in all situations, but we're actively working on improving it. Our goal is to eventually have that shuffle manager be the preferred shuffle when using the RAPIDS Accelerator, even if the cluster does not have RDMA-capable hardware.
If you add more GPUs (and thus more executors, since an executor with the plugin can only control 1 GPU), yes, performance should be improved to a point. This would be similar to adding executors to a CPU job. If you have enough GPUs in your cluster so that CPU_cores_per_executor == concurrent_GPU_tasks then no task will ever wait on the GPU semaphore.
Originally posted by @jlowe in #5394 (comment)
The text was updated successfully, but these errors were encountered: