-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424
Conversation
cc: @HyukjinKwon @ueshin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
transform($"letters", (letter: Column) => | ||
struct(number, letter))))), | ||
Seq(Row(Seq(Row(1, "a"), Row(1, "b"), Row(2, "a"), Row(2, "b")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style. should be 2-space indent?
Kubernetes integration test starting |
Kubernetes integration test status failure |
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
GA passed. Merged to master/3.1/3.0. Thank you for the review, @ueshin ~ |
Test build #138151 has finished for PR 32424 at commit
|
Test build #138165 has finished for PR 32424 at commit
|
Why it's a problem only in scala API? how about SQL API? |
BTW, it has the same problem in Python and R too. I and @ueshin are working on them as followups. |
Since R and Python ones are merged into 3.1, I will create separate JIRAs: https://issues.apache.org/jira/browse/SPARK-35381 |
In SQL, since user-specified param names are used as they are, the same issue cannot happen;
On the other hand, In DataFame APIs, the same param names (
|
Ur, I missed that. Thank you, @HyukjinKwon @ueshin |
…er functions at R APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…er functions at R APIs This PR fixes the same issue as #32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` To produce the correct results. Yes, it fixes the results to be correct as mentioned above. Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ecb48cc) Signed-off-by: Hyukjin Kwon <[email protected]>
…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 17b59a9) Signed-off-by: Hyukjin Kwon <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 6df4ec0) Signed-off-by: Dongjoon Hyun <[email protected]>
…er functions at R APIs This PR fixes the same issue as apache#32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` To produce the correct results. Yes, it fixes the results to be correct as mentioned above. Manually tested as above, and unit test was added. Closes apache#32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ecb48cc) Signed-off-by: Hyukjin Kwon <[email protected]>
…rame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as apache#32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes apache#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 17b59a9) Signed-off-by: Hyukjin Kwon <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for
LambdaVariables
names created by higher order functions.This is the rework of #31887. Closes #31887.
Why are the changes needed?
This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)
For this query:
This is the current (incorrect) output:
And this is the correct output after fix:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added the new test in
DataFrameFunctionsSuite
.