Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

Closed
wants to merge 5 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented May 3, 2021

What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for LambdaVariables names created by higher order functions.

This is the rework of #31887. Closes #31887.

Why are the changes needed?

This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:

val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)

This is the current (incorrect) output:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

And this is the correct output after fix:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added the new test in DataFrameFunctionsSuite.

@github-actions github-actions bot added the SQL label May 3, 2021
@maropu
Copy link
Member Author

maropu commented May 3, 2021

cc: @HyukjinKwon @ueshin

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment on lines 3637 to 3639
transform($"letters", (letter: Column) =>
struct(number, letter))))),
Seq(Row(Seq(Row(1, "a"), Row(1, "b"), Row(2, "a"), Row(2, "b"))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style. should be 2-space indent?

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42672/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42672/

@maropu maropu closed this in f550e03 May 5, 2021
maropu added a commit that referenced this pull request May 5, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
maropu added a commit that referenced this pull request May 5, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
@maropu
Copy link
Member Author

maropu commented May 5, 2021

GA passed. Merged to master/3.1/3.0. Thank you for the review, @ueshin ~

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138151 has finished for PR 32424 at commit 0164e0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138165 has finished for PR 32424 at commit ea961ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Why it's a problem only in scala API? how about SQL API?

@HyukjinKwon
Copy link
Member

BTW, it has the same problem in Python and R too. I and @ueshin are working on them as followups.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 12, 2021

Since R and Python ones are merged into 3.1, I will create separate JIRAs:

https://issues.apache.org/jira/browse/SPARK-35381
https://issues.apache.org/jira/browse/SPARK-35382

@maropu
Copy link
Member Author

maropu commented May 12, 2021

Why it's a problem only in scala API? how about SQL API?

In SQL, since user-specified param names are used as they are, the same issue cannot happen;

scala> val df = Seq((Seq(1,2,3), Seq("a", "b", "c"))).toDF("numbers", "letters")
scala> df.selectExpr("""
     |     FLATTEN(
     |         TRANSFORM(
     |             numbers,
     |             number -> TRANSFORM(
     |                 letters,
     |                 letter -> (number AS number, letter AS letter)
     |             )
     |         )
     |     ) AS zipped
     | """).explain(true)

== Analyzed Logical Plan ==
zipped: array<struct<number:int,letter:string>>
Project [flatten(transform(numbers#7, lambdafunction(transform(letters#8, lambdafunction(named_struct(number, lambda number#14, letter, lambda letter#15), lambda letter#15, false)), lambda number#14, false))) AS zipped#13]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
+- Project [_1#2 AS numbers#7, _2#3 AS letters#8]
   +- LocalRelation [_1#2, _2#3]

On the other hand, In DataFame APIs, the same param names (x, y, and z) were used in lambda functions, so the name conflict could happen;

scala> df.select(
     |     flatten(
     |         transform(
     |             $"numbers",
     |             (number: Column) => { transform(
     |                 $"letters",
     |                 (letter: Column) => { struct(
     |                     number.as("number"),
     |                     letter.as("letter")
     |                 ) }
     |             ) }
     |         )
     |     ).as("zipped")
     | ).explain(true)

== Analyzed Logical Plan ==
zipped: array<struct<number:int,letter:string>>
Project [flatten(transform(numbers#7, lambdafunction(transform(letters#8, lambdafunction(struct(number, lambda x_0#20, letter, lambda x_1#21), lambda x_1#21, false)), lambda x_0#20, false))) AS zipped#19]
                                                                                                                                               ^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
+- Project [_1#2 AS numbers#7, _2#3 AS letters#8]
   +- LocalRelation [_1#2, _2#3]

@maropu
Copy link
Member Author

maropu commented May 12, 2021

BTW, it has the same problem in Python and R too. I and @ueshin are working on them as followups.

Ur, I missed that. Thank you, @HyukjinKwon @ueshin

HyukjinKwon added a commit that referenced this pull request May 12, 2021
…er functions at R APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as #32424

```r
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
  df,
  array_transform("numbers", function(number) {
    array_transform("letters", function(latter) {
      struct(alias(number, "n"), alias(latter, "l"))
    })
  })
))
```

**Before:**

```
... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
```

**After:**

```
... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Manually tested as above, and unit test was added.

Closes #32517 from HyukjinKwon/SPARK-35381.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request May 12, 2021
…er functions at R APIs

This PR fixes the same issue as #32424

```r
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
  df,
  array_transform("numbers", function(number) {
    array_transform("letters", function(latter) {
      struct(alias(number, "n"), alias(latter, "l"))
    })
  })
))
```

**Before:**

```
... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
```

**After:**

```
... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
```

To produce the correct results.

Yes, it fixes the results to be correct as mentioned above.

Manually tested as above, and unit test was added.

Closes #32517 from HyukjinKwon/SPARK-35381.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit ecb48cc)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request May 13, 2021
…rame functions in Python APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as #32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request May 13, 2021
…rame functions in Python APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as #32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 17b59a9)
Signed-off-by: Hyukjin Kwon <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of apache#31887. Closes apache#31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes apache#32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 6df4ec0)
Signed-off-by: Dongjoon Hyun <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…er functions at R APIs

This PR fixes the same issue as apache#32424

```r
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
  df,
  array_transform("numbers", function(number) {
    array_transform("letters", function(latter) {
      struct(alias(number, "n"), alias(latter, "l"))
    })
  })
))
```

**Before:**

```
... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
```

**After:**

```
... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
```

To produce the correct results.

Yes, it fixes the results to be correct as mentioned above.

Manually tested as above, and unit test was added.

Closes apache#32517 from HyukjinKwon/SPARK-35381.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit ecb48cc)
Signed-off-by: Hyukjin Kwon <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…rame functions in Python APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as apache#32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes apache#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 17b59a9)
Signed-off-by: Hyukjin Kwon <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of apache#31887. Closes apache#31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes apache#32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants