[QST] I need someone to help explain the shuffle read code in rapids (thanks very much) #5384

JustPlay · 2020-09-28T13:13:59Z

JustPlay
Sep 28, 2020

What is your question?

Is there anyone who can help explain the whole shuffle execution process (in particular the shuffle read)

When i see the rapids code，i found the following scala files are related to shuffle (shuffle read), but i can not make it clear
about what actually will happen (when doing shuffle read), for example the code's call-chain is not clear to me，i think i need some detailed info

Thanks

GpuPartitioning.scala
GpuColumnarBatchSerializer.scala
GpuShuffleExchangeExec.scala
GpuCustomShuffleReaderExec.scala
ShuffledBatchRDD.scala
GpuCoalesceBatches.scala

Answered by revans2

Sep 28, 2020

@andygrove please correct me if I get anything wrong in relation to AQE.
@abellina please correct anything I get wrong for the UCX based shuffle.

There are two different shuffle instances in the plugin.

The first one is based off of Spark's sql shuffle. The SQL shuffle is in turn based off of the RDD shuffle.

In the default RDD shuffle is the SortShuffleManager. For a sort the user is able to control how the data is serialized, if the data is sorted or not before it is shuffled, and how the data is partitioned. All of this is controlled by a ShuffleDependency.

In the SQL shuffle for Spark this is all wrapped in/controlled by the ShuffleExchangeExec. It will perform the partitioning ahead …

View full answer

revans2 · 2020-09-28T14:14:31Z

revans2
Sep 28, 2020
Maintainer

@andygrove please correct me if I get anything wrong in relation to AQE.
@abellina please correct anything I get wrong for the UCX based shuffle.

There are two different shuffle instances in the plugin.

The first one is based off of Spark's sql shuffle. The SQL shuffle is in turn based off of the RDD shuffle.

In the default RDD shuffle is the SortShuffleManager. For a sort the user is able to control how the data is serialized, if the data is sorted or not before it is shuffled, and how the data is partitioned. All of this is controlled by a ShuffleDependency.

In the SQL shuffle for Spark this is all wrapped in/controlled by the ShuffleExchangeExec. It will perform the partitioning ahead of time and output an RDD[(Int, UnsafeRow)] to the shuffle, where the int is the partition the data is supposed to go to, it also disables the sort functionality in shuffle. The shuffle will then get the partition number for each entry and send it to the serializer. The serializer will then throw away the partition int and write out the UnsafeRow prepended with a length. There have been a few changes on the backend for this but generally this then goes through compression and is written out as a part of a stream with one file per downstream partition. After the task completes all of these small files and read back in and written out as a single file. This is to reduce the total number of files.

The ShuffledRowRDD is just a special RDD that helps with all of this.

On the reading side a thread pool is launched to try and read in the different partitions. This glosses over the meta-data exchange that happens so the readers know where the data is. But the thread pool will read in different batches and place them in a queue to be consumed. It is rather complex because there is throttling involved to avoid DDOS attacks on the servers involved, and there is failure and retry logic on the shuffles as well. Essentially the data is read back in, deserialized, and sent to a Queue that an RDD wraps so the rest of the down stream processing can consume the data.

When AQE is enabled a single shuffle may need to be modified. A regular shuffle might turn into a broadcast shuffle, of skewed joins might cause some parts of the shuffle to act like a broadcast while others act more like a regular shuffle. This is all handled by CustomShuffleReaderExec

The GPU versions of all of these act in the same way.

GpuShuffleExchangeExec takes the place of ShuffleExchangeExec. It uses GpuPartitioning.scala to partition the data as desired by the SQL query, along with ShuffledBatchRDD and the GpuColumnarBatchSerializer. A GpuCustomShuffleReaderExec handles AQE transformations of the shuffle data as needed.

The main differences that you have to be aware of are.

We are shuffling columnar data instead of row based data so it is an RDD[Int, ColumnarBatch) that we output instead of UnsafeRow. This also means that metrics for the shuffle will look differently because the data will compress differently, etc.
We need to think a lot about data movement between the CPU and the GPU. Partitioning the data often results in really small chunks. If the data is also already columnar, and we have a separate memory buffer per column, we can end up with a very large number of memory copy calls to move data between the CPU and the GPU, which results in a lot of overhead. So instead we copy the data to the host first, which is on the order of O(number of columns), and then slice the data on the CPU as we serialize it. This is handled by using the class SlicedGpuColumnVector, which despite its name is already on the CPU. It was named before we we did a lot of refactoring and should be renamed. When reading the data we have separated out the metadata from the actual data payload so copying it to the GPU is on the order of O(number of partitions). Which is not ideal, but it does work fairly well.
We try to avoid overloading the GPU. This is the GpuSemaphore that we are trying to improve how it works in relation to fetching shuffle data, and there is a lot more work we need to put into this.
We try to make downstream processing more efficient. This is GpuCoalesceBatches After a shuffle the data is in relatively small batches. So we stitch them back together into larger batches so we can process them with less overhead. We have been through a few iterations on this, but right now it is trying to hit a target data size. It also serves a dual role with the other shuffle implementation. If we didn't do any batch coalesce the batch sizes would get exponentially smaller after each shuffle stage until we were operating on them a row at a time.

The second shuffle implementation is based off UCX and tries to avoid the CPU whenever possible. It replaces the default SortShuffleManager with our own, but it talks with the GpuShuffleExechangeExec to know if an RDD is for data on the GPU or on the CPU and falls back to SortShuffleManager for all CPU data.

For the GPU data it bypasses the serialization, writing the data to disk multiple times, and the compression. Instead it will split the data up into ContiguousTables as a part of the partitioning. This is the one place where which shuffle manager we are using matters to the GpuShuffleExchangeExec. After that the shuffle tries to keep the data cached in GPU memory. If we run out of GPU memory then we will spill to host memory, and if we hit limits there, then we start to spill to disk. We also will transfer the data using UCX. This lets us do RDMA transfers where ever possible, and even use NVLink connections where available.

We have recently started to add in GPU compression support for this too. We batch compress the outgoing buffers before registering them with a GPU cache manager, similar in concept to the CPU version in Spark, but implemented very differently. The main thing to be aware of here is that the GPU does compression/decompression well if there is a lot of data to process. It is not as good with small amounts of data. For this reason we have opted to not decompress the data until it gets to the first GpuCoalesceBatch instance. In these cases the compressed data is represented using a GpuCompressedColumnVector This lets us batch decompress the data instead of trying to do each small part separately. But to make this work we have to guarantee that there will always be a coalesce right after the shuffle.

It is a bit ugly and could use some clean up especially because of the coupling between the different parts to work properly.

I hope that this helps. At some point when the design is in less flux I need to write this up with diagrams and all so it can be a part of our documentation.

0 replies

abellina · 2020-09-28T15:24:18Z

abellina
Sep 28, 2020
Collaborator

Looks good to me @revans2

0 replies

andygrove · 2020-09-28T16:06:47Z

andygrove
Sep 28, 2020

Looks good to me too @revans2

0 replies

sameerz · 2020-09-29T20:27:17Z

sameerz
Sep 29, 2020
Maintainer

Please reopen if you still have questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] I need someone to help explain the shuffle read code in rapids (thanks very much) #5384

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] I need someone to help explain the shuffle read code in rapids (thanks very much) #5384

JustPlay Sep 28, 2020

Replies: 4 comments

revans2 Sep 28, 2020 Maintainer

abellina Sep 28, 2020 Collaborator

andygrove Sep 28, 2020

sameerz Sep 29, 2020 Maintainer

JustPlay
Sep 28, 2020

revans2
Sep 28, 2020
Maintainer

abellina
Sep 28, 2020
Collaborator

andygrove
Sep 28, 2020

sameerz
Sep 29, 2020
Maintainer