[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

ueshin · 2018-02-05T08:31:17Z

What changes were proposed in this pull request?

In #18664, there was a change in how DateType is being returned to users (line 1968 in dataframe.py). This can cause client code which works in Spark 2.2 to fail.
See SPARK-23290 for an example.

This pr modifies to use datetime.date for date type as Spark 2.2 does.

How was this patch tested?

Tests modified to fit the new behavior and existing tests.

…ndas DataFrame.

ueshin · 2018-02-05T08:32:57Z

cc @BryanCutler @icexelloss @HyukjinKwon @cloud-fan @gatorsmile

SparkQA · 2018-02-05T09:08:03Z

Test build #87062 has finished for PR 20506 at commit 57ab41b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-05T14:07:58Z

python/pyspark/sql/types.py

+    """ Correct date type value to use datetime.date.
+
+    Pandas DataFrame created from PyArrow uses datetime64[ns] for date type values, but we should
+    use datetime.date to keep backward compatibility.


Shall we say like to match it with when Arrow optimization is disabled?

Maybe we don't need to say about backward compatibility here. I'll update it.

SparkQA · 2018-02-05T15:11:44Z

Test build #87071 has finished for PR 20506 at commit ebdbd8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for looking into this @ueshin , I thought it was a bug that dates were being interpreted as objects? I would like to check what the pdf dtype is when it is created directly from a date time.date object. Wouldn't most users prefer datetime64 types anyway?

BryanCutler · 2018-02-05T18:31:52Z

python/pyspark/sql/dataframe.py

@@ -2020,8 +2021,6 @@ def _to_corrected_pandas_type(dt):
        return np.int32
    elif type(dt) == FloatType:
        return np.float32
-    elif type(dt) == DateType:
-        return 'datetime64[ns]'


I thought we were considering the interpretation of DateType as object as a bug, similar to how FloatType was being interpreted as float64?

+1, I feel it was a bug. Maybe we can merge this to branch-2.3 only and update the migration guide in the master branch?

HyukjinKwon · 2018-02-06T04:54:24Z

I originally thought similarly but after another look into this again, it seems it would rather be better to keep it consistent with what Pandas does for now. FYI, seems datetime.date -> object in Pandas:

>>> pd.Series([datetime.date(2012,1,1)])
0    2012-01-01
dtype: object

and looks it needs a explicit conversion:

>>> pd.Series([pd.Timestamp(datetime.date(2012,1,1))])
0   2012-01-01
dtype: datetime64[ns]

Given datetime.datetime and datetime.date are not directly comparable, seems making sense to have a different type at least for now. I think we can even go with it into the master and then research the past discussion within Pandas after 2.3.0.

I have been reading related discussions from yesterday within Pandas dev and seems we should go with object. For example see https://github.com/pandas-dev/pandas/issues/6932#issuecomment-41084598 and https://github.com/pandas-dev/pandas/issues/4338 (I left links with code blocks to avoid messing up links to other repos).

Maybe I missed something here. What do you guys think?

cloud-fan · 2018-02-06T05:12:43Z

python/pyspark/sql/types.py

@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema):
         for field in arrow_schema])


+def _correct_date_of_dataframe_from_arrow(pdf, schema):


to be consistent with other methods in this file, how about _check_dataframe_convert_date

Sure. I'll update it.

cloud-fan · 2018-02-06T05:18:24Z

python/pyspark/sql/tests.py

@@ -4062,18 +4062,42 @@ def test_vectorized_udf_unsupported_types(self):
            with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
                df.select(f(col('map'))).collect()

-    def test_vectorized_udf_null_date(self):
+    def test_vectorized_udf_dates(self):


shall we have a new test to directly verify the toPandas works?

Maybe ArrowTests.test_toPandas_arrow_toggle:

spark/python/pyspark/sql/tests.py

Lines 3461 to 3464 in ebdbd8c

def test_toPandas_arrow_toggle(self):

df = self.spark.createDataFrame(self.data, schema=self.schema)

pdf, pdf_arrow = self._toPandas_arrow_toggle(df)

self.assertPandasEqual(pdf_arrow, pdf)

?

In addition, I'll modify it to check between its expected Pandas DataFrame.

cloud-fan · 2018-02-06T05:19:03Z

@HyukjinKwon SGTM!

SparkQA · 2018-02-06T06:20:58Z

Test build #87092 has finished for PR 20506 at commit f151cdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-06T06:53:02Z

LGTM, merging to master!

cloud-fan · 2018-02-06T06:54:55Z

@ueshin can you send a new PR for 2.3? it conflicts, thanks!

…rting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20506 from ueshin/issues/SPARK-23290.

BryanCutler · 2018-02-06T08:42:11Z

a late +1 for me since it seems like Pandas needs an explicit conversion to get to datetime64 and doesn't directly support datetime.date

…ype when converting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? This is a backport of #20506. In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes #20515 from ueshin/issues/SPARK-23290_2.3.

ueshin · 2018-02-06T09:36:44Z

Thanks! @HyukjinKwon @BryanCutler @cloud-fan

…rting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20506 from ueshin/issues/SPARK-23290.

ueshin added 2 commits February 5, 2018 15:52

Use datetime.date for date type when converting Spark DataFrame to Pa…

223d0a0

…ndas DataFrame.

Modify a test for date type.

57ab41b

HyukjinKwon reviewed Feb 5, 2018

View reviewed changes

Address a comment.

ebdbd8c

BryanCutler reviewed Feb 5, 2018

View reviewed changes

cloud-fan reviewed Feb 6, 2018

View reviewed changes

ueshin added 2 commits February 6, 2018 14:46

Modify a method name.

8823043

Modify a test to check between its expected Pandas DataFrame.

f151cdf

asfgit closed this in a24c031 Feb 6, 2018

ueshin mentioned this pull request Feb 6, 2018

[SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20515

Closed

dansanduleac mentioned this pull request Feb 6, 2018

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when conve… palantir/spark#306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

ueshin commented Feb 5, 2018

ueshin commented Feb 5, 2018

SparkQA commented Feb 5, 2018

HyukjinKwon Feb 5, 2018

ueshin Feb 5, 2018

SparkQA commented Feb 5, 2018

BryanCutler left a comment

BryanCutler Feb 5, 2018

cloud-fan Feb 6, 2018

HyukjinKwon commented Feb 6, 2018 •

edited

Loading

cloud-fan Feb 6, 2018

ueshin Feb 6, 2018

cloud-fan Feb 6, 2018

ueshin Feb 6, 2018

cloud-fan commented Feb 6, 2018

SparkQA commented Feb 6, 2018

cloud-fan commented Feb 6, 2018 •

edited

Loading

cloud-fan commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

ueshin commented Feb 6, 2018

		@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema):
		for field in arrow_schema])


		def _correct_date_of_dataframe_from_arrow(pdf, schema):

	def test_toPandas_arrow_toggle(self):
	df = self.spark.createDataFrame(self.data, schema=self.schema)
	pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
	self.assertPandasEqual(pdf_arrow, pdf)

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

Conversation

ueshin commented Feb 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Feb 5, 2018

SparkQA commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 6, 2018

SparkQA commented Feb 6, 2018

cloud-fan commented Feb 6, 2018 • edited Loading

cloud-fan commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

ueshin commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018 •

edited

Loading

cloud-fan commented Feb 6, 2018 •

edited

Loading