Fix collect_set_on_nested_type tests failed #8783

thirtiseven · 2023-07-24T09:03:51Z

collect_set function currently does not support NaNs in struct[Array(Double)] or struct[Array(Float)] types, and NaNs are set to not equal to each others by it, see #6079.

But the current IT will generate such data if we have bad luck, leading to test failure.

This PR prevents datagen from generating NaNs when testing `collect_set' on nested types.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-07-24T09:22:57Z

build

ttnghia · 2023-07-24T13:09:26Z

What will happen if we feed in struct[Array(Double)] or struct[Array(Float)] types? Will it fallback to the CPU or something else?

jlowe

This doesn't look like the proper fix to resolve the test failure. The test is failing because we don't match the CPU for some corner cases. The fix isn't to ignore those corner cases in the test but rather correct the code so those corner cases can now pass the test. If we cannot fix the corner cases in a timely manner, then we need to fallback to the CPU for those corner cases. In this case, it would be falling back to the CPU if the nested types contain floats or doubles anywhere in the type tree.

thirtiseven · 2023-07-25T09:51:40Z

This doesn't look like the proper fix to resolve the test failure. The test is failing because we don't match the CPU for some corner cases.

Ok, thanks, I didn't notice that the hasNans config in the original PR has been removed, so it really is a bug now.

As this comment said, usually when we use non-nested versions of floats and doubles, NaN values are considered unequal, but when collecting sets of nested array versions, NaN equality is considered on the CPU.

This incompatibility was controlled by spark.rapids.sql.hasNans, but we removed this config later.

What will happen if we feed in struct[Array(Double)] or struct[Array(Float)] types? Will it fallback to the CPU or something else?

So now collect_set will produce different results without fallbacks.

I haven't found out why Spark behaves like this yet, the Spark code of collect_set looks just like add data into a HashSet then convert it to Array. But I guess it is because of some special handling of NaNs in Spark. If it is complicated to follow or needs cuDF support, I will fall back to cpu in this case and file another issue.

thirtiseven · 2023-07-25T10:09:51Z

Some test results:

>>> from pyspark.sql.functions import collect_set
>>> df1 = spark.createDataFrame([(1.0,), (float("nan"),), (float("nan"),)], ["value"])
>>> df1.agg(collect_set("value")).show(truncate=False)
23/07/25 18:07:49 WARN GpuOverrides:
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(collect_set(value)#4 as string) AS collect_set(value)#7 will run on GPU
    *Expression <Cast> cast(collect_set(value)#4 as string) will run on GPU
  *Exec <ObjectHashAggregateExec> will run on GPU
    *Expression <AggregateExpression> collect_set(value#0, 0, 0) will run on GPU
      *Expression <CollectSet> collect_set(value#0, 0, 0) will run on GPU
    *Expression <Alias> collect_set(value#0, 0, 0)#2 AS collect_set(value)#4 will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <SinglePartition$> will run on GPU
      *Exec <ObjectHashAggregateExec> will run on GPU. The data type of following expressions will be converted in GPU runtime: buf#10: Converted BinaryType to ArrayType(DoubleType,false)
        *Expression <AggregateExpression> partial_collect_set(value#0, 0, 0) will run on GPU
          *Expression <CollectSet> collect_set(value#0, 0, 0) will run on GPU
        ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
          @Expression <AttributeReference> value#0 could run on GPU

+------------------+
|collect_set(value)|
+------------------+
|[NaN, NaN, 1.0]   |
+------------------+

>>> df2 = spark.createDataFrame([([1.0, float("nan")],), ([1.0, float("nan")],), ([1.0, 2.0],)], ["value"])
>>> df2.agg(collect_set("value")).show(truncate=False)
23/07/25 18:07:57 WARN GpuOverrides:
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(collect_set(value)#28 as string) AS collect_set(value)#31 will run on GPU
    *Expression <Cast> cast(collect_set(value)#28 as string) will run on GPU
  *Exec <ObjectHashAggregateExec> will run on GPU
    *Expression <AggregateExpression> collect_set(value#24, 0, 0) will run on GPU
      *Expression <CollectSet> collect_set(value#24, 0, 0) will run on GPU
    *Expression <Alias> collect_set(value#24, 0, 0)#26 AS collect_set(value)#28 will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <SinglePartition$> will run on GPU
      *Exec <ObjectHashAggregateExec> will run on GPU. The data type of following expressions will be converted in GPU runtime: buf#34: Converted BinaryType to ArrayType(ArrayType(DoubleType,true),false)
        *Expression <AggregateExpression> partial_collect_set(value#24, 0, 0) will run on GPU
          *Expression <CollectSet> collect_set(value#24, 0, 0) will run on GPU
        ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
          @Expression <AttributeReference> value#24 could run on GPU

+------------------------------------+
|collect_set(value)                  |
+------------------------------------+
|[[1.0, NaN], [1.0, 2.0], [1.0, NaN]]|
+------------------------------------+

>>> spark.conf.set("spark.rapids.sql.enabled", "false")
>>> df1.agg(collect_set("value")).show(truncate=False)
+------------------+
|collect_set(value)|
+------------------+
|[1.0, NaN, NaN]   |
+------------------+

>>> df2.agg(collect_set("value")).show(truncate=False)
+------------------------+
|collect_set(value)      |
+------------------------+
|[[1.0, 2.0], [1.0, NaN]]|
+------------------------+
>>>

thirtiseven · 2023-07-26T09:10:31Z

Updated code to fallback to cpu if there is double/float in nested type, and filed issue #8808

revans2 · 2023-07-26T15:28:45Z

integration_tests/src/main/python/hash_aggregate_test.py

+@allow_non_gpu('ObjectHashAggregateExec', 'ShuffleExchangeExec', 'CollectSet')
+@pytest.mark.parametrize('data_gen', _gen_data_for_collect_set_op_floats, ids=idfn)
+def test_hash_groupby_collect_set_fallback_on_nested_floats(data_gen):
+    assert_gpu_and_cpu_are_equal_collect(


please use assert_gpu_fallback_collect instead. It will help to verify that we actually did fall back on what we expect.

revans2 · 2023-07-26T15:29:10Z

integration_tests/src/main/python/hash_aggregate_test.py

+@allow_non_gpu('ObjectHashAggregateExec', 'ShuffleExchangeExec', 'CollectSet')
+@pytest.mark.parametrize('data_gen', _gen_data_for_collect_set_op_floats, ids=idfn)
+def test_hash_reduction_collect_set_fallback_on_nested_floats(data_gen):
+    assert_gpu_and_cpu_are_equal_collect(


Same comment here. if the test is for a fallback we should use the fallback verification API.

revans2 · 2023-07-26T15:30:50Z

integration_tests/src/main/python/window_function_test.py

    ('c_struct_array_2', RepeatSeqGen(StructGen([
-        ['c0', struct_array_gen_no_nans], ['c1', int_gen]]), length=14)),
-    ('c_array_struct', RepeatSeqGen(ArrayGen(all_basic_struct_gen_no_nan), length=15)),
+        ['c0', struct_array_gen_no_floats], ['c1', int_gen]]), length=14)),


nit: Can we verify that we fallback for window operations too?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

thirtiseven · 2023-07-27T02:14:40Z

build

integration_tests/src/main/python/hash_aggregate_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-07-28T10:05:39Z

build

* Match Spark's NaN handling in collect_set Signed-off-by: Haoyang Li <[email protected]> * Revert "Fix collect_set_on_nested_type tests failed (#8783)" * clean up and add comments * remove xfail * update cudfmergesets too * edit comment * remove related nan datagens * clean up * clean up --------- Signed-off-by: Haoyang Li <[email protected]>

Fix collect_set_on_nested_type test failed

fb1c522

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested a review from ttnghia July 24, 2023 09:03

thirtiseven self-assigned this Jul 24, 2023

jlowe reviewed Jul 24, 2023

View reviewed changes

sameerz added the bug Something isn't working label Jul 25, 2023

Fallback to cpu if double/float in nested type

1d6c327

Add fallback tests

7170209

revans2 reviewed Jul 26, 2023

View reviewed changes

Update fallback tests

bd9791c

jlowe reviewed Jul 27, 2023

View reviewed changes

integration_tests/src/main/python/hash_aggregate_test.py Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Show resolved Hide resolved

thirtiseven added 2 commits July 28, 2023 16:47

add psnote and tests

33b01ea

Signed-off-by: Haoyang Li <[email protected]>

verify

6b91e5e

Signed-off-by: Haoyang Li <[email protected]>

revans2 approved these changes Jul 28, 2023

View reviewed changes

jlowe approved these changes Jul 28, 2023

View reviewed changes

jlowe merged commit dc3d67e into NVIDIA:branch-23.08 Jul 28, 2023

thirtiseven added a commit to thirtiseven/spark-rapids that referenced this pull request Aug 2, 2023

Revert "Fix collect_set_on_nested_type tests failed (NVIDIA#8783)"

e9e482b

thirtiseven added a commit to thirtiseven/spark-rapids that referenced this pull request Aug 2, 2023

Revert "Fix collect_set_on_nested_type tests failed (NVIDIA#8783)"

6f1f8f1

thirtiseven mentioned this pull request Aug 3, 2023

Match Spark's NaN handling in collect_set #8909

Merged

thirtiseven deleted the collect_set_fix branch August 18, 2023 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix collect_set_on_nested_type tests failed #8783

Fix collect_set_on_nested_type tests failed #8783

thirtiseven commented Jul 24, 2023

thirtiseven commented Jul 24, 2023

ttnghia commented Jul 24, 2023

jlowe left a comment

thirtiseven commented Jul 25, 2023

thirtiseven commented Jul 25, 2023

thirtiseven commented Jul 26, 2023

revans2 Jul 26, 2023

thirtiseven Jul 27, 2023

revans2 Jul 26, 2023

revans2 Jul 26, 2023

thirtiseven Jul 27, 2023

thirtiseven commented Jul 27, 2023

thirtiseven commented Jul 28, 2023

Fix collect_set_on_nested_type tests failed #8783

Fix collect_set_on_nested_type tests failed #8783

Conversation

thirtiseven commented Jul 24, 2023

thirtiseven commented Jul 24, 2023

ttnghia commented Jul 24, 2023

jlowe left a comment

Choose a reason for hiding this comment

thirtiseven commented Jul 25, 2023

thirtiseven commented Jul 25, 2023

thirtiseven commented Jul 26, 2023

revans2 Jul 26, 2023

Choose a reason for hiding this comment

thirtiseven Jul 27, 2023

Choose a reason for hiding this comment

revans2 Jul 26, 2023

Choose a reason for hiding this comment

revans2 Jul 26, 2023

Choose a reason for hiding this comment

thirtiseven Jul 27, 2023

Choose a reason for hiding this comment

thirtiseven commented Jul 27, 2023

thirtiseven commented Jul 28, 2023