Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala #26351

Closed
wants to merge 4 commits into from

Conversation

imback82
Copy link
Contributor

What changes were proposed in this pull request?

This PR changes the behavior of Column.getItem to call Column.getItem on Scala side instead of Column.apply.

Why are the changes needed?

The current behavior is not consistent with that of Scala.

In PySpark:

df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col.getItem(col('id'))).show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+

In Scala:

val df = spark.range(2)
val map_col = map(lit(0), lit(100), lit(1), lit(200))
// The following getItem results in the following exception, which is the right behavior:
// java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id
//  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
//  at org.apache.spark.sql.Column.getItem(Column.scala:856)
//  ... 49 elided
df.withColumn("mapped", map_col.getItem(col("id"))).show

Does this PR introduce any user-facing change?

Yes. If the use wants to pass Column object to getItem, he/she now needs to use the indexing operator to achieve the previous behavior.

df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col[col('id'))].show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+

How was this patch tested?

Existing tests.

@imback82
Copy link
Contributor Author

cc: @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113051 has finished for PR 26351 at commit b0c4896.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113053 has finished for PR 26351 at commit 097f212.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent sounds ok. A concern is about this is a behavior change could break user code.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 1, 2019

Yeah, true. Workaround is pretty easy here (col[key] ) instead. I think it should be fine since we're in Spark 3.

@@ -23,6 +23,8 @@
from pyspark.sql.utils import AnalysisException
from pyspark.testing.sqlutils import ReusedSQLTestCase

from py4j.protocol import Py4JJavaError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move py4j over pyspark per pep8 (https://www.python.org/dev/peps/pep-0008/#imports)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113062 has finished for PR 26351 at commit b6d2f74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants