[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

imback82 · 2019-10-31T22:17:52Z

What changes were proposed in this pull request?

This PR changes the behavior of Column.getItem to call Column.getItem on Scala side instead of Column.apply.

Why are the changes needed?

The current behavior is not consistent with that of Scala.

In PySpark:

df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col.getItem(col('id'))).show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+

In Scala:

val df = spark.range(2)
val map_col = map(lit(0), lit(100), lit(1), lit(200))
// The following getItem results in the following exception, which is the right behavior:
// java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id
//  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
//  at org.apache.spark.sql.Column.getItem(Column.scala:856)
//  ... 49 elided
df.withColumn("mapped", map_col.getItem(col("id"))).show

Does this PR introduce any user-facing change?

Yes. If the use wants to pass Column object to getItem, he/she now needs to use the indexing operator to achieve the previous behavior.

df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col[col('id'))].show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+

How was this patch tested?

Existing tests.

imback82 · 2019-10-31T22:18:40Z

cc: @HyukjinKwon

SparkQA · 2019-10-31T22:23:31Z

Test build #113051 has finished for PR 26351 at commit b0c4896.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-31T23:10:08Z

Test build #113053 has finished for PR 26351 at commit 097f212.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

To be consistent sounds ok. A concern is about this is a behavior change could break user code.

HyukjinKwon · 2019-11-01T00:45:42Z

Yeah, true. Workaround is pretty easy here (col[key] ) instead. I think it should be fine since we're in Spark 3.

HyukjinKwon · 2019-11-01T00:50:55Z

python/pyspark/sql/tests/test_column.py

@@ -23,6 +23,8 @@
 from pyspark.sql.utils import AnalysisException
 from pyspark.testing.sqlutils import ReusedSQLTestCase

+from py4j.protocol import Py4JJavaError


Can you move py4j over pyspark per pep8 (https://www.python.org/dev/peps/pep-0008/#imports)?

Fixed. Thanks!

HyukjinKwon

LGTM otherwise.

SparkQA · 2019-11-01T02:56:53Z

Test build #113062 has finished for PR 26351 at commit b6d2f74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-01T03:25:37Z

Merged to master.

imback82 added 2 commits October 31, 2019 14:49

initial checkin

faf4a77

add unit test

b0c4896

fix linter error

097f212

viirya reviewed Oct 31, 2019

View reviewed changes

HyukjinKwon reviewed Nov 1, 2019

View reviewed changes

HyukjinKwon approved these changes Nov 1, 2019

View reviewed changes

address PR comments

b6d2f74

HyukjinKwon closed this in 3175f4b Nov 1, 2019

This was referenced Nov 3, 2019

[BUG]: Cannot use column as input to mapping column dotnet/spark#201

Closed

Exposed Column.Apply() API dotnet/spark#323

Merged

imback82 deleted the spark-29664 branch November 3, 2019 20:01

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

imback82 commented Oct 31, 2019

imback82 commented Oct 31, 2019

SparkQA commented Oct 31, 2019

SparkQA commented Oct 31, 2019

viirya left a comment

HyukjinKwon commented Nov 1, 2019 •

edited

Loading

HyukjinKwon Nov 1, 2019

imback82 Nov 1, 2019

HyukjinKwon left a comment

SparkQA commented Nov 1, 2019

HyukjinKwon commented Nov 1, 2019

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

Conversation

imback82 commented Oct 31, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

imback82 commented Oct 31, 2019

SparkQA commented Oct 31, 2019

SparkQA commented Oct 31, 2019

viirya left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 1, 2019 • edited Loading

HyukjinKwon Nov 1, 2019

Choose a reason for hiding this comment

imback82 Nov 1, 2019

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 1, 2019

HyukjinKwon commented Nov 1, 2019

HyukjinKwon commented Nov 1, 2019 •

edited

Loading