Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-39760][PYTHON] Support Varchar in PySpark
### What changes were proposed in this pull request? Support Varchar in PySpark ### Why are the changes needed? function parity ### Does this PR introduce _any_ user-facing change? yes, new datatype supported ### How was this patch tested? 1, added UT; 2, manually check against the scala side: ```python In [1]: from pyspark.sql.types import * ...: from pyspark.sql.functions import * ...: ...: df = spark.createDataFrame([(1,), (11,)], ["value"]) ...: ret = df.select(col("value").cast(VarcharType(10))).collect() ...: 22/07/13 17:17:07 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier In [2]: In [2]: schema = StructType([StructField("a", IntegerType(), True), (StructField("v", VarcharType(10), True))]) ...: description = "this a table created via Catalog.createTable()" ...: table = spark.catalog.createTable("tab3_via_catalog", schema=schema, description=description) ...: table.schema ...: Out[2]: StructType([StructField('a', IntegerType(), True), StructField('v', StringType(), True)]) ``` ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = spark.range(0, 10).selectExpr(" id AS value") df: org.apache.spark.sql.DataFrame = [value: bigint] scala> val ret = df.select(col("value").cast(VarcharType(10))).collect() 22/07/13 17:28:56 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier ret: Array[org.apache.spark.sql.Row] = Array([0], [1], [2], [3], [4], [5], [6], [7], [8], [9]) scala> scala> val schema = StructType(StructField("a", IntegerType, true) :: (StructField("v", VarcharType(10), true) :: Nil)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,VarcharType(10),true)) scala> val description = "this a table created via Catalog.createTable()" description: String = this a table created via Catalog.createTable() scala> val table = spark.catalog.createTable("tab3_via_catalog", source="json", schema=schema, description=description, options=Map.empty[String, String]) table: org.apache.spark.sql.DataFrame = [a: int, v: string] scala> table.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,StringType,true)) ``` Closes apache#37173 from zhengruifeng/py_add_varchar. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
- Loading branch information