[SPARK-39760][PYTHON] Support Varchar in PySpark #37173

zhengruifeng · 2022-07-13T09:34:27Z

What changes were proposed in this pull request?

Support Varchar in PySpark

Why are the changes needed?

function parity

Does this PR introduce any user-facing change?

yes, new datatype supported

How was this patch tested?

1, added UT;
2, manually check against the scala side:

In [1]: from pyspark.sql.types import *
   ...: from pyspark.sql.functions import *
   ...: 
   ...: df = spark.createDataFrame([(1,), (11,)], ["value"])
   ...: ret = df.select(col("value").cast(VarcharType(10))).collect()
   ...: 
22/07/13 17:17:07 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier
                                                                                
In [2]: 

In [2]: schema = StructType([StructField("a", IntegerType(), True), (StructField("v", VarcharType(10), True))])
   ...: description = "this a table created via Catalog.createTable()"
   ...: table = spark.catalog.createTable("tab3_via_catalog", schema=schema, description=description)
   ...: table.schema
   ...: 
Out[2]: StructType([StructField('a', IntegerType(), True), StructField('v', StringType(), True)])

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.range(0, 10).selectExpr(" id AS value")
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> val ret = df.select(col("value").cast(VarcharType(10))).collect()
22/07/13 17:28:56 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier
ret: Array[org.apache.spark.sql.Row] = Array([0], [1], [2], [3], [4], [5], [6], [7], [8], [9])

scala> 

scala> val schema = StructType(StructField("a", IntegerType, true) :: (StructField("v", VarcharType(10), true) :: Nil))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,VarcharType(10),true))

scala> val description = "this a table created via Catalog.createTable()"
description: String = this a table created via Catalog.createTable()

scala> val table = spark.catalog.createTable("tab3_via_catalog", source="json", schema=schema, description=description, options=Map.empty[String, String])
table: org.apache.spark.sql.DataFrame = [a: int, v: string]

scala> table.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,StringType,true))

zhengruifeng · 2022-07-14T09:27:03Z

gentle ping @HyukjinKwon @cloud-fan

xinrong-meng · 2022-07-14T20:38:01Z

python/pyspark/sql/types.py

+    Parameters
+    ----------
+    length : int
+    the length limitation. Data writing will fail if the input


nit: indent?

xinrong-meng · 2022-07-14T20:42:28Z

python/pyspark/sql/types.py

@@ -1659,8 +1698,8 @@ def verify_acceptable_types(obj: Any) -> None:
                new_msg("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
            )

-    if isinstance(dataType, StringType):
-        # StringType can work with any types
+    if isinstance(dataType, StringType) or isinstance(dataType, VarcharType):


nit: isinstance(dataType, (StringType, VarcharType))?

xinrong-meng · 2022-07-14T20:48:02Z

python/pyspark/sql/types.py

@@ -181,6 +182,29 @@ class StringType(AtomicType, metaclass=DataTypeSingleton):
    pass


+class VarcharType(AtomicType):


metaclass=DataTypeSingleton?

I firstly tried this, but it cause initialization error due to the __call__ in DataTypeSingleton:
a type mixin with DataTypeSingleton should support constructor without parameters.

Makes sense! Let me follow up with a parametric DataTypeSingleton then.

https://issues.apache.org/jira/browse/SPARK-39794

xinrong-meng · 2022-07-14T20:57:43Z

python/pyspark/sql/tests/test_types.py

+        self.assertTrue(v2 is not v1)
+        self.assertNotEqual(v1, v2)
+        v3 = VarcharType(10)
+        self.assertEqual(v1, v3)


Shall we check if v1 is v3?

v1 is v3 should be True after parametric singleton is introduced. I will adjust that together in the followup.

xinrong-meng · 2022-07-15T17:34:11Z

LGTM! Thanks!

HyukjinKwon

We'll have to fix the codes when spark.sql.execution.arrow.pyspark.enabled is enabled, and Py4J + Python UDFs too. But we can do it separately. See also https://issues.apache.org/jira/browse/SPARK-37275

zhengruifeng · 2022-07-18T07:57:44Z

Merged to master, thank you all!

@HyukjinKwon I will send followup PR for arrow/py4j/udf, thanks for pointing out it!

cloud-fan · 2022-07-18T08:32:45Z

late LGTM, do we want to support CharType in pyspark?

zhengruifeng · 2022-07-18T09:35:13Z

@cloud-fan I think so, let me add it in near future.

@HyukjinKwon Btw, I guess we may not need to add extra support for pyarrow or py4j+python UDF, because it seems that there is not a class for char/varchar instances in scala or python (pandas/numpy/built-in), and they are treated as string internally.

There is also a warning message if we want to cast to char/varchar, in CharVarcharUtils
The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion.

github-actions bot added CORE PYTHON SQL labels Jul 13, 2022

zhengruifeng added 2 commits July 14, 2022 09:57

init

0941d4f

fix

b1577a2

zhengruifeng force-pushed the py_add_varchar branch from 2460a9f to b1577a2 Compare July 14, 2022 04:45

zhengruifeng added 2 commits July 14, 2022 13:12

fix lint

c1cd865

nit

b62921f

zhengruifeng changed the title ~~[SPARK-39760][PYTHON][WIP] Support Varchar in PySpark~~ [SPARK-39760][PYTHON] Support Varchar in PySpark Jul 14, 2022

zhengruifeng marked this pull request as ready for review July 14, 2022 05:15

fix doctest

4dce80b

xinrong-meng reviewed Jul 14, 2022

View reviewed changes

zhengruifeng force-pushed the py_add_varchar branch from e0a9a51 to a220009 Compare July 15, 2022 10:14

address comments

4889243

zhengruifeng force-pushed the py_add_varchar branch from a220009 to 4889243 Compare July 15, 2022 10:26

xinrong-meng approved these changes Jul 15, 2022

View reviewed changes

HyukjinKwon approved these changes Jul 16, 2022

View reviewed changes

zhengruifeng closed this in 08808fb Jul 18, 2022

zhengruifeng deleted the py_add_varchar branch July 18, 2022 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39760][PYTHON] Support Varchar in PySpark #37173

[SPARK-39760][PYTHON] Support Varchar in PySpark #37173

zhengruifeng commented Jul 13, 2022 •

edited

Loading

zhengruifeng commented Jul 14, 2022

xinrong-meng Jul 14, 2022

xinrong-meng Jul 14, 2022

zhengruifeng Jul 15, 2022

xinrong-meng Jul 14, 2022

zhengruifeng Jul 15, 2022

xinrong-meng Jul 15, 2022

xinrong-meng Jul 15, 2022

xinrong-meng Jul 14, 2022

xinrong-meng Jul 15, 2022

xinrong-meng commented Jul 15, 2022

HyukjinKwon left a comment

zhengruifeng commented Jul 18, 2022

cloud-fan commented Jul 18, 2022

zhengruifeng commented Jul 18, 2022

		@@ -181,6 +182,29 @@ class StringType(AtomicType, metaclass=DataTypeSingleton):
		pass


		class VarcharType(AtomicType):

[SPARK-39760][PYTHON] Support Varchar in PySpark #37173

[SPARK-39760][PYTHON] Support Varchar in PySpark #37173

Conversation

zhengruifeng commented Jul 13, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Jul 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinrong-meng commented Jul 15, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jul 18, 2022

cloud-fan commented Jul 18, 2022

zhengruifeng commented Jul 18, 2022

zhengruifeng commented Jul 13, 2022 •

edited

Loading