-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506
Conversation
Test build #87062 has finished for PR 20506 at commit
|
python/pyspark/sql/types.py
Outdated
""" Correct date type value to use datetime.date. | ||
|
||
Pandas DataFrame created from PyArrow uses datetime64[ns] for date type values, but we should | ||
use datetime.date to keep backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we say like to match it with when Arrow optimization is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to say about backward compatibility here. I'll update it.
Test build #87071 has finished for PR 20506 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @ueshin , I thought it was a bug that dates were being interpreted as objects? I would like to check what the pdf dtype is when it is created directly from a date time.date object. Wouldn't most users prefer datetime64 types anyway?
@@ -2020,8 +2021,6 @@ def _to_corrected_pandas_type(dt): | |||
return np.int32 | |||
elif type(dt) == FloatType: | |||
return np.float32 | |||
elif type(dt) == DateType: | |||
return 'datetime64[ns]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were considering the interpretation of DateType as object as a bug, similar to how FloatType was being interpreted as float64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I feel it was a bug. Maybe we can merge this to branch-2.3 only and update the migration guide in the master branch?
I originally thought similarly but after another look into this again, it seems it would rather be better to keep it consistent with what Pandas does for now. FYI, seems
and looks it needs a explicit conversion:
Given I have been reading related discussions from yesterday within Pandas dev and seems we should go with Maybe I missed something here. What do you guys think? |
python/pyspark/sql/types.py
Outdated
@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema): | |||
for field in arrow_schema]) | |||
|
|||
|
|||
def _correct_date_of_dataframe_from_arrow(pdf, schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with other methods in this file, how about _check_dataframe_convert_date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll update it.
@@ -4062,18 +4062,42 @@ def test_vectorized_udf_unsupported_types(self): | |||
with self.assertRaisesRegexp(Exception, 'Unsupported data type'): | |||
df.select(f(col('map'))).collect() | |||
|
|||
def test_vectorized_udf_null_date(self): | |||
def test_vectorized_udf_dates(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we have a new test to directly verify the toPandas
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ArrowTests.test_toPandas_arrow_toggle
:
spark/python/pyspark/sql/tests.py
Lines 3461 to 3464 in ebdbd8c
def test_toPandas_arrow_toggle(self): | |
df = self.spark.createDataFrame(self.data, schema=self.schema) | |
pdf, pdf_arrow = self._toPandas_arrow_toggle(df) | |
self.assertPandasEqual(pdf_arrow, pdf) |
?
In addition, I'll modify it to check between its expected Pandas DataFrame.
@HyukjinKwon SGTM! |
Test build #87092 has finished for PR 20506 at commit
|
LGTM, merging to master! |
@ueshin can you send a new PR for 2.3? it conflicts, thanks! |
…rting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20506 from ueshin/issues/SPARK-23290.
a late +1 for me since it seems like Pandas needs an explicit conversion to get to datetime64 and doesn't directly support |
…ype when converting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? This is a backport of #20506. In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes #20515 from ueshin/issues/SPARK-23290_2.3.
Thanks! @HyukjinKwon @BryanCutler @cloud-fan |
…rting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20506 from ueshin/issues/SPARK-23290.
What changes were proposed in this pull request?
In #18664, there was a change in how
DateType
is being returned to users (line 1968 in dataframe.py). This can cause client code which works in Spark 2.2 to fail.See SPARK-23290 for an example.
This pr modifies to use
datetime.date
for date type as Spark 2.2 does.How was this patch tested?
Tests modified to fit the new behavior and existing tests.