Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the update of RapidsShuffleManager to resolve the bottleneck for waiting to acquire the semaphore #5650

Open
cfangplus opened this issue May 26, 2022 · 8 comments
Labels
performance A performance related task/issue question Further information is requested

Comments

@cfangplus
Copy link

The main bottleneck during this portion of the query is waiting to acquire the semaphore. However the tasks owning the semaphore are not making full use of the GPU. The main portion of time they are spending is decompressing shuffle data on the CPU and copying it down to the GPU. They need to own the GPU semaphore during this phase because they are placing data onto the GPU. The whole point of the GPU semaphore is to prevent too many tasks from placing data onto the GPU at the same time and exhausting the GPU's memory.

Essentially the main bottleneck in that stage is dealing with the shuffle data and transfer to the GPU, because that's what's taking so long for the tasks holding the GPU semaphore to release it. Once the shuffle data is loaded on the GPU, the rest of the stage processing is quite fast. The RapidsShuffleManager was designed explicitly to target this shuffle problem, as it tries to keep shuffle targets in GPU memory and not rely on the CPU for compression/decompression which can be a bottleneck. Unfortunately there are a number of issues with RapidsShuffleManager that prevent it from working well in all situations, but we're actively working on improving it. Our goal is to eventually have that shuffle manager be the preferred shuffle when using the RAPIDS Accelerator, even if the cluster does not have RDMA-capable hardware.

if i use 2 gpu like T4, Performance will be significantly improved ?

If you add more GPUs (and thus more executors, since an executor with the plugin can only control 1 GPU), yes, performance should be improved to a point. This would be similar to adding executors to a CPU job. If you have enough GPUs in your cluster so that CPU_cores_per_executor == concurrent_GPU_tasks then no task will ever wait on the GPU semaphore.

Originally posted by @jlowe in #5394 (comment)

@cfangplus cfangplus changed the title The main bottleneck during this portion of the query is waiting to acquire the semaphore. However the tasks owning the semaphore are not making full use of the GPU. The main portion of time they are spending is decompressing shuffle data on the CPU and copying it down to the GPU. They need to own the GPU semaphore during this phase because they are placing data onto the GPU. The whole point of the GPU semaphore is to prevent too many tasks from placing data onto the GPU at the same time and exhausting the GPU's memory. What's the update of RapidsShuffleManager to resolve the bottleneck for waiting to acquire the semaphore May 26, 2022
@cfangplus
Copy link
Author

cfangplus commented May 26, 2022

My SQL contains three tables and two joins, I used the SQL Tab of Spark WebUI, analyzed the details of the query, list the time cost of each node of the task within the last spark stage and compared them with CPU mode. I attached it here. It seems that GPU operator like Sort, HashAggregate are quitely faster than those correspoding CPU operator, howerver , GPU mode contains additional operators like GpuShuffleCoalesce and GpuCoaleseBatches, and these two operator cost too much times which slow down the total performance.
I have not used nv nsight system yet, and from the past disscustion there, #5394 (comment), I guess that acquiring the GPU semaphore is the bottleneck. So now my question is, how could I resolve the performance problem and what's the update of RapidsShuffleManager ?
time costs

@jlowe
Copy link
Contributor

jlowe commented May 26, 2022

The metrics above do not indicate GpuShuffleCoalesce and GpuCoalesceBatches are directly the issue. For both operations, the concat batch time total is the time spent actually doing the operation that node requires (i.e.: concatenating multiple batches together into a larger batch), and in both cases, the time is pretty low (even less than a millisecond in many cases). The collect batch time total metric is the time that node spent waiting for input -- in other words, waiting for the node above it in the plan to produce output. With the concat metric being low and the collect time being high, that indicates the time was spent waiting for shuffle (or whatever node precedes it in the query plan).

Regarding the GPU semaphore, which version of the RAPIDS Accelerator are you using? We have added an explicit metric for time spent waiting for the GPU semaphore in many SQL UI nodes which is available in release 21.12 and after. It is disabled by default to keep the number of metrics manageable for the driver, but it can be enabled in those releases by setting spark.rapids.sql.metrics.level=DEBUG.

We have also made strides in recent releases to avoid holding the GPU semaphore while not actively processing data on the GPU (e.g.: #4588 and #4476). There are still instances where this can happen, and the problem is quite complex. For example, it would be relatively easy to always release the GPU semaphore when performing shuffle I/O or other host-based operations, but doing so while data remains in GPU memory allows new tasks to start adding data to GPU memory and it can easily lead to an OOM or heavy thrashing with excessive memory spill. The thrashing can be so bad that it can be faster to hold onto the semaphore and prevent too many tasks from trying to use the GPU simultaneously even though it seems wasteful at first glance. It all depends on how fast the network is, how fast local disks are, how much memory new tasks will add to the GPU before the I/O completes, etc. etc. It is a tricky problem with many variables that are difficult to predict.

@jlowe jlowe added the ? - Needs Triage Need team to review and classify label May 26, 2022
@cfangplus
Copy link
Author

I updated the version of RAPIDS Accelerator to 22.04 and run the SQL again.
I found that the collect batch time in GPUShuffleCoalesce and GPUCoalesceBatches disappeared, and yes, the GPU semaphore wait time comes in at 3.2s/5s = 60%. Here is some spark properties ,spark.rapids.sql.concorrentGpuTasks = 4 while spark.executor.cores = 12

@sameerz sameerz added question Further information is requested and removed ? - Needs Triage Need team to review and classify labels May 31, 2022
@mattahrens
Copy link
Collaborator

Work is planned in the 22.08 release cycle to look for improvement opportunities related to semaphore acquisition: #4568.

@mattahrens mattahrens added the performance A performance related task/issue label May 31, 2022
@cfangplus
Copy link
Author

cfangplus commented Jun 8, 2022

I have another question about the SQL, which mentioned above and contains three tables and two joins. I attched the physical plan here.
sql physical plan
The table grid which shows above provide the time cost for each plan node within the last stage, which contains the two join operations. And as you can see, we can estimate the total time of each task for this stage, 738ms + 827ms + 32ms + 22ms + 1.6s + 3.5s + 10ms + 3.5s + 48ms + 52ms + 5.1s + 1.7s + 2ms + 1.6s + 1.6s = 20.329s. However, the SPARK WEBUI shows that the time cost for task within the last stage is nearly 5s. WHY?
media task time

@jlowe
Copy link
Contributor

jlowe commented Jun 8, 2022

we can estimate the total time of each task for this stage, 738ms + 827ms + 32ms + 22ms + 1.6s + 3.5s + 10ms + 3.5s + 48ms + 52ms + 5.1s + 1.7s + 2ms + 1.6s + 1.6s = 20.329s

There's quite a bit of double-counting in this calculation. For example, the collect batch time total metric for GpuCoalesceBatches is timing how long it took to collect all the input batches, which is effectively timing all the nodes above this node in the stage rather than something specific to this node. In this example, that includes the time for the filter, running window, sort, and shuffle coalesce. In general, if you're interested in the time associated with just what one node in the plan is doing, you should ignore any "collect time" metrics as that is timing how long it took other nodes to produce output rather than this node. There are other metrics, like build time total and stream time total, that similarly can end up timing other nodes. That's why the time spent in build time + stream time for the join towards the end of the stage ends up very close to the total task time.

The op time total metric will focus on just how long a single node is taking to perform its operation separate from time spent waiting for input iterators.

@cfangplus
Copy link
Author

yea, nice comment, great thx @jlowe
Another question, as you reply before:

With the concat metric being low and the collect time being high, that indicates the time was spent waiting for shuffle (or whatever node precedes it in the query plan).

Now as I have updated the version of RAPIDS Accelerator from 21.10 to 22.04. I found that the collect batch time in GPUShuffleCoalesce and GPUCoalesceBatches disappeared, and yes, the GPU semaphore wait time comes in at 3.5s, and it comes to be 3.2/5s, nearly 60% proportion in the task time cost. I list the detail here. So the question is, the bottleneck is GPU semaphore or shuffle read? BTW, I didn't use Rapids shuffle manager here, as before we tested and its result would be worse, maybe after the 22.8 release, we could take a try ?
semaphore2

@jlowe
Copy link
Contributor

jlowe commented Jun 13, 2022

So the question is, the bottleneck is GPU semaphore or shuffle read?

Clearly semaphore wait is the main bottleneck for this task, but it's difficult to say for sure whether there is shuffle being performed while the semaphore is being held (generally undesirable when this occurs but it does happen in some cases).
An Nsight trace would be able to show whether the semaphore is being held while shuffle is being performed.

Note that semaphore wait in itself is not necessarily something that needs to be eliminated at all costs, as the semaphore's intent is to prevent situations where adding more concurrent tasks to the GPU may lead to an out-of-memory error on the GPU. For example, if you have very many concurrent tasks configured for an executor (e.g.: 256 cores) but a relatively small GPU (e.g.: T4 with only 16GB) then it is fully expected to see a relatively high semaphore wait time since most tasks will be waiting their turn to use the GPU. If all 256 concurrent tasks tried to use the T4 at the same time, it's very likely the GPU will run out of memory.

There are a couple of ways to help reduce semaphore wait time. First is seeing if you can run more concurrent tasks on the GPU by configuring spark.rapids.sql.concurrentGpuTasks. The more concurrent tasks that are allowed on the GPU, the lower the average semaphore wait time per task. Setting this too high can lead to GPU OOM errors. Another way to reduce the time is by reducing the number of CPU cores per executor. This will run fewer tasks concurrently per executor which can hurt query performance for portions that run only on the CPU, but it will also reduce CPU memory pressure from having so many tasks trying to run simultaneously which can occasionally run faster overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants