-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39760][PYTHON] Support Varchar in PySpark #37173
Conversation
2460a9f
to
b1577a2
Compare
gentle ping @HyukjinKwon @cloud-fan |
python/pyspark/sql/types.py
Outdated
Parameters | ||
---------- | ||
length : int | ||
the length limitation. Data writing will fail if the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent?
python/pyspark/sql/types.py
Outdated
@@ -1659,8 +1698,8 @@ def verify_acceptable_types(obj: Any) -> None: | |||
new_msg("%s can not accept object %r in type %s" % (dataType, obj, type(obj))) | |||
) | |||
|
|||
if isinstance(dataType, StringType): | |||
# StringType can work with any types | |||
if isinstance(dataType, StringType) or isinstance(dataType, VarcharType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isinstance(dataType, (StringType, VarcharType))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
@@ -181,6 +182,29 @@ class StringType(AtomicType, metaclass=DataTypeSingleton): | |||
pass | |||
|
|||
|
|||
class VarcharType(AtomicType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metaclass=DataTypeSingleton
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I firstly tried this, but it cause initialization error due to the __call__
in DataTypeSingleton
:
a type mixin with DataTypeSingleton
should support constructor without parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Let me follow up with a parametric DataTypeSingleton then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertTrue(v2 is not v1) | ||
self.assertNotEqual(v1, v2) | ||
v3 = VarcharType(10) | ||
self.assertEqual(v1, v3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check if v1 is v3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1 is v3
should be True after parametric singleton is introduced. I will adjust that together in the followup.
e0a9a51
to
a220009
Compare
a220009
to
4889243
Compare
LGTM! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to fix the codes when spark.sql.execution.arrow.pyspark.enabled
is enabled, and Py4J + Python UDFs too. But we can do it separately. See also https://issues.apache.org/jira/browse/SPARK-37275
Merged to master, thank you all! @HyukjinKwon I will send followup PR for arrow/py4j/udf, thanks for pointing out it! |
late LGTM, do we want to support |
@cloud-fan I think so, let me add it in near future. @HyukjinKwon Btw, I guess we may not need to add extra support for pyarrow or py4j+python UDF, because it seems that there is not a class for char/varchar instances in scala or python (pandas/numpy/built-in), and they are treated as string internally. There is also a warning message if we want to cast to char/varchar, in |
What changes were proposed in this pull request?
Support Varchar in PySpark
Why are the changes needed?
function parity
Does this PR introduce any user-facing change?
yes, new datatype supported
How was this patch tested?
1, added UT;
2, manually check against the scala side: