Fix two potential OOM issues in GPU aggregate. #11908

firestarman · 2025-01-02T07:46:57Z

close #11903

The first one is by taking the nested literals into account when calculating the output size for pre-split. See the linked issue above for more details.

The second one is by using the correct size for buffer size comparison when collecting the next bundle of batches in aggregate. The size return from the batchesByBucket.last.size() is not the actual buffer size in bytes,
but the element number of an array. It can not be used for the buffer size comparison.

I verified this PR locally by the toy query and it works well.

The first one is by taking the nested literals into account when calculating the output size for pre-split. The second one is by using the correct size for buffer size comparison when collecting the next bundle of batches in aggregate. Signed-off-by: Firestarman <[email protected]>

binmahone · 2025-01-02T09:02:58Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

-                currentSize += bucket.map(_.sizeInBytes).sum
-                toAggregateBuckets += bucket
+              var keepGoing = true
+              while (batchesByBucket.nonEmpty && keepGoing) {


may need use a separate PR for this as it is irrelevant to the description in #11903

Thx for review. @binmahone Could you help file an issue for this? Then I will follow your suggestion.

you can also consider modifying the description in 11903 :-)

binmahone · 2025-01-02T10:16:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

+        case ArrayType(elemType, hasNullElem) =>
+          val numElems = pickRowNum(litVal.asInstanceOf[ArrayData].numElements())
+          // A GPU array literal requires only one column as the child
+          estimateLitAdditionSize(hasNullElem, hasOffset(elemType), numElems)


is there a risk of over estimation here?

My concern is more about accuracy. For fixed width types we were able to estimate the size almost exactly. For items with offsets we could not and we estimated values more on the small end of things. We were being conservative. For literal values we should be able to get a value that is almost exactly the right size. We can test this for many different literal values and see how close we end up getting.

My concern is for highly nested types. We recurse for the calcLitValueSize, but not for estimateLitAdditionSize. So if we have literal values that are highly nested (like array of array), then the estimate is going to be wrong.

I have done the refactor to calculate the size more exactly.
I follow the type definitions from https://github.com/rapidsai/cudf/blob/a0487be669326175982c8bfcdab4d61184c88e27/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#list-columns

firestarman · 2025-01-02T10:35:45Z

build

revans2

This needs a lot of testing in different cases to be sure that it works properly. Especially for nested types.

revans2 · 2025-01-07T20:57:57Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

+      val pickRowNum: Int => Int = rowNum => if (litVal == null) 0 else rowNum
+      litType match {
+        case ArrayType(elemType, hasNullElem) =>
+          val numElems = pickRowNum(litVal.asInstanceOf[ArrayData].numElements())


We need a null check here for a null array literal value.

Updated by the refactor.

revans2 · 2025-01-07T20:58:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

+            estimateLitAdditionSize(f.nullable, hasOffset(f.dataType), childrenNumRows)
+          ).sum
+        case MapType(keyType, valType, hasNullValue) =>
+          val mapRowsNum = pickRowNum(litVal.asInstanceOf[MapData].numElements())


This too needs a check for litVal being null before this runs.

Updated by the refactor.

revans2 · 2025-01-07T21:04:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

+      case StringType => lit.asInstanceOf[UTF8String].numBytes()
+      case BinaryType => lit.asInstanceOf[Array[Byte]].length
+      case ArrayType(elemType, _) =>
+        lit.asInstanceOf[ArrayData].array.map(calcLitValueSize(_, elemType)).sum


This appears to only work on UnsafeArrayData. If I test with the following it fails. (I opened up the API to be public so I could test things manually)

scala> PreProjectSplitIterator.calcMemorySizeForLiteral(ArrayData.toArrayData(Array(1L)), ArrayType(LongType), 100) java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData. at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.array(UnsafeArrayData.java:103) at com.nvidia.spark.rapids.PreProjectSplitIterator$.calcLitValueSize(basicPhysicalOperators.scala:381) at com.nvidia.spark.rapids.PreProjectSplitIterator$.calcMemorySizeForLiteral(basicPhysicalOperators.scala:328) ... 51 elided

Nice catch, thx a lot, updated.

firestarman · 2025-01-08T00:27:30Z

Thx for review, I am doing a small refactor for the meta size calculation part... Will update it once done.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-01-08T08:40:51Z

Update for early review, I will run some tests next.

firestarman · 2025-01-08T08:43:12Z

This needs a lot of testing in different cases to be sure that it works properly. Especially for nested types.

Will work on this next.

Signed-off-by: Firestarman <[email protected]>

revans2

Looks good to me. I still would like to see some tests (at least some simple unit tests) to validate that it is doing the right thing and that we are covering corner cases like nulls, and arrays of arrays.

firestarman · 2025-01-09T08:23:47Z

Looks good to me. I still would like to see some tests (at least some simple unit tests) to validate that it is doing the right thing and that we are covering corner cases like nulls, and arrays of arrays.

Sure, working on the tests now...

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-01-09T14:46:01Z

Added some tests, also found some bugs and fixed them.

firestarman · 2025-01-09T14:46:06Z

build

firestarman changed the title ~~Fix two potential OOM issues in agg.~~ Fix two potential OOM issues in GPU aggregate. Jan 2, 2025

binmahone reviewed Jan 2, 2025

View reviewed changes

revans2 requested changes Jan 7, 2025

View reviewed changes

Address the comments

97ca823

Signed-off-by: Firestarman <[email protected]>

firestarman added 2 commits January 8, 2025 16:47

Fix a build error

0a23ffe

Signed-off-by: Firestarman <[email protected]>

fix a potential NPE

626cccc

Signed-off-by: Firestarman <[email protected]>

revans2 reviewed Jan 8, 2025

View reviewed changes

add tests

af5b90e

Signed-off-by: Firestarman <[email protected]>

firestarman requested review from binmahone and revans2 January 9, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix two potential OOM issues in GPU aggregate. #11908

Fix two potential OOM issues in GPU aggregate. #11908

firestarman commented Jan 2, 2025

binmahone Jan 2, 2025

firestarman Jan 2, 2025

binmahone Jan 2, 2025

firestarman Jan 2, 2025

binmahone Jan 2, 2025

revans2 Jan 6, 2025

firestarman Jan 8, 2025

firestarman commented Jan 2, 2025

revans2 left a comment

revans2 Jan 7, 2025

firestarman Jan 8, 2025

revans2 Jan 7, 2025

firestarman Jan 8, 2025

revans2 Jan 7, 2025

firestarman Jan 8, 2025

firestarman commented Jan 8, 2025

firestarman commented Jan 8, 2025

firestarman commented Jan 8, 2025

revans2 left a comment

firestarman commented Jan 9, 2025

firestarman commented Jan 9, 2025

firestarman commented Jan 9, 2025

Fix two potential OOM issues in GPU aggregate. #11908

Are you sure you want to change the base?

Fix two potential OOM issues in GPU aggregate. #11908

Conversation

firestarman commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman commented Jan 2, 2025

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman commented Jan 8, 2025

firestarman commented Jan 8, 2025

firestarman commented Jan 8, 2025

revans2 left a comment

Choose a reason for hiding this comment

firestarman commented Jan 9, 2025

firestarman commented Jan 9, 2025

firestarman commented Jan 9, 2025