[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

maropu · 2021-05-03T15:03:01Z

What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for LambdaVariables names created by higher order functions.

This is the rework of #31887. Closes #31887.

Why are the changes needed?

This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:

val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)

This is the current (incorrect) output:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

And this is the correct output after fix:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added the new test in DataFrameFunctionsSuite.

maropu · 2021-05-03T23:41:08Z

cc: @HyukjinKwon @ueshin

ueshin

LGTM.

ueshin · 2021-05-04T18:15:16Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+        transform($"letters", (letter: Column) =>
+          struct(number, letter))))),
+        Seq(Row(Seq(Row(1, "a"), Row(1, "b"), Row(2, "a"), Row(2, "b"))))


nit: style. should be 2-space indent?

SparkQA · 2021-05-05T01:59:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42672/

SparkQA · 2021-05-05T01:59:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42672/

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>

maropu · 2021-05-05T03:47:43Z

GA passed. Merged to master/3.1/3.0. Thank you for the review, @ueshin ~

SparkQA · 2021-05-05T05:08:35Z

Test build #138151 has finished for PR 32424 at commit 0164e0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-05T06:26:31Z

Test build #138165 has finished for PR 32424 at commit ea961ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-12T03:48:20Z

Why it's a problem only in scala API? how about SQL API?

HyukjinKwon · 2021-05-12T03:59:36Z

BTW, it has the same problem in Python and R too. I and @ueshin are working on them as followups.

HyukjinKwon · 2021-05-12T04:01:07Z

Since R and Python ones are merged into 3.1, I will create separate JIRAs:

https://issues.apache.org/jira/browse/SPARK-35381
https://issues.apache.org/jira/browse/SPARK-35382

maropu · 2021-05-12T04:47:16Z

Why it's a problem only in scala API? how about SQL API?

In SQL, since user-specified param names are used as they are, the same issue cannot happen;

scala> val df = Seq((Seq(1,2,3), Seq("a", "b", "c"))).toDF("numbers", "letters")
scala> df.selectExpr("""
     |     FLATTEN(
     |         TRANSFORM(
     |             numbers,
     |             number -> TRANSFORM(
     |                 letters,
     |                 letter -> (number AS number, letter AS letter)
     |             )
     |         )
     |     ) AS zipped
     | """).explain(true)

== Analyzed Logical Plan ==
zipped: array<struct<number:int,letter:string>>
Project [flatten(transform(numbers#7, lambdafunction(transform(letters#8, lambdafunction(named_struct(number, lambda number#14, letter, lambda letter#15), lambda letter#15, false)), lambda number#14, false))) AS zipped#13]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
+- Project [_1#2 AS numbers#7, _2#3 AS letters#8]
   +- LocalRelation [_1#2, _2#3]

On the other hand, In DataFame APIs, the same param names (x, y, and z) were used in lambda functions, so the name conflict could happen;

scala> df.select(
     |     flatten(
     |         transform(
     |             $"numbers",
     |             (number: Column) => { transform(
     |                 $"letters",
     |                 (letter: Column) => { struct(
     |                     number.as("number"),
     |                     letter.as("letter")
     |                 ) }
     |             ) }
     |         )
     |     ).as("zipped")
     | ).explain(true)

== Analyzed Logical Plan ==
zipped: array<struct<number:int,letter:string>>
Project [flatten(transform(numbers#7, lambdafunction(transform(letters#8, lambdafunction(struct(number, lambda x_0#20, letter, lambda x_1#21), lambda x_1#21, false)), lambda x_0#20, false))) AS zipped#19]
                                                                                                                                               ^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
+- Project [_1#2 AS numbers#7, _2#3 AS letters#8]
   +- LocalRelation [_1#2, _2#3]

maropu · 2021-05-12T04:47:46Z

BTW, it has the same problem in Python and R too. I and @ueshin are working on them as followups.

Ur, I missed that. Thank you, @HyukjinKwon @ueshin

…er functions at R APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…er functions at R APIs This PR fixes the same issue as #32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` To produce the correct results. Yes, it fixes the results to be correct as mentioned above. Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ecb48cc) Signed-off-by: Hyukjin Kwon <[email protected]>

…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 17b59a9) Signed-off-by: Hyukjin Kwon <[email protected]>

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 6df4ec0) Signed-off-by: Dongjoon Hyun <[email protected]>

…er functions at R APIs This PR fixes the same issue as apache#32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` To produce the correct results. Yes, it fixes the results to be correct as mentioned above. Manually tested as above, and unit test was added. Closes apache#32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ecb48cc) Signed-off-by: Hyukjin Kwon <[email protected]>

…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as apache#32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes apache#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 17b59a9) Signed-off-by: Hyukjin Kwon <[email protected]>

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>

dsolow and others added 4 commits May 3, 2021 22:09

Add AtomicInteger to make var names unique

f1bc30d

added test to verify nested transform

2cd874a

fixed typo

e9a398b

Fix

0164e0f

github-actions bot added the SQL label May 3, 2021

ueshin approved these changes May 4, 2021

View reviewed changes

review

ea961ee

maropu closed this in f550e03 May 5, 2021

HyukjinKwon mentioned this pull request May 12, 2021

[SPARK-35381][R] Fix lambda variable name issues in nested higher order functions at R APIs #32517

Closed

ueshin mentioned this pull request May 12, 2021

[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs. #32523

Closed

viirya mentioned this pull request Jun 3, 2021

[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction #32735

Closed

beliefer mentioned this pull request Mar 7, 2023

[SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names #40287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

maropu commented May 3, 2021

maropu commented May 3, 2021

ueshin left a comment

ueshin May 4, 2021

SparkQA commented May 5, 2021

SparkQA commented May 5, 2021

maropu commented May 5, 2021

SparkQA commented May 5, 2021

SparkQA commented May 5, 2021

cloud-fan commented May 12, 2021

HyukjinKwon commented May 12, 2021

HyukjinKwon commented May 12, 2021 •

edited

Loading

maropu commented May 12, 2021 •

edited

Loading

maropu commented May 12, 2021

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

Conversation

maropu commented May 3, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented May 3, 2021

ueshin left a comment

Choose a reason for hiding this comment

ueshin May 4, 2021

Choose a reason for hiding this comment

SparkQA commented May 5, 2021

SparkQA commented May 5, 2021

maropu commented May 5, 2021

SparkQA commented May 5, 2021

SparkQA commented May 5, 2021

cloud-fan commented May 12, 2021

HyukjinKwon commented May 12, 2021

HyukjinKwon commented May 12, 2021 • edited Loading

maropu commented May 12, 2021 • edited Loading

maropu commented May 12, 2021

HyukjinKwon commented May 12, 2021 •

edited

Loading

maropu commented May 12, 2021 •

edited

Loading