Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. #20506

Closed
wants to merge 5 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Feb 5, 2018

What changes were proposed in this pull request?

In #18664, there was a change in how DateType is being returned to users (line 1968 in dataframe.py). This can cause client code which works in Spark 2.2 to fail.
See SPARK-23290 for an example.

This pr modifies to use datetime.date for date type as Spark 2.2 does.

How was this patch tested?

Tests modified to fit the new behavior and existing tests.

@ueshin
Copy link
Member Author

ueshin commented Feb 5, 2018

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87062 has finished for PR 20506 at commit 57ab41b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

""" Correct date type value to use datetime.date.

Pandas DataFrame created from PyArrow uses datetime64[ns] for date type values, but we should
use datetime.date to keep backward compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we say like to match it with when Arrow optimization is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to say about backward compatibility here. I'll update it.

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87071 has finished for PR 20506 at commit ebdbd8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @ueshin , I thought it was a bug that dates were being interpreted as objects? I would like to check what the pdf dtype is when it is created directly from a date time.date object. Wouldn't most users prefer datetime64 types anyway?

@@ -2020,8 +2021,6 @@ def _to_corrected_pandas_type(dt):
return np.int32
elif type(dt) == FloatType:
return np.float32
elif type(dt) == DateType:
return 'datetime64[ns]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were considering the interpretation of DateType as object as a bug, similar to how FloatType was being interpreted as float64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I feel it was a bug. Maybe we can merge this to branch-2.3 only and update the migration guide in the master branch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 6, 2018

I originally thought similarly but after another look into this again, it seems it would rather be better to keep it consistent with what Pandas does for now. FYI, seems datetime.date -> object in Pandas:

>>> pd.Series([datetime.date(2012,1,1)])
0    2012-01-01
dtype: object

and looks it needs a explicit conversion:

>>> pd.Series([pd.Timestamp(datetime.date(2012,1,1))])
0   2012-01-01
dtype: datetime64[ns]

Given datetime.datetime and datetime.date are not directly comparable, seems making sense to have a different type at least for now. I think we can even go with it into the master and then research the past discussion within Pandas after 2.3.0.

I have been reading related discussions from yesterday within Pandas dev and seems we should go with object. For example see https://github.com/pandas-dev/pandas/issues/6932#issuecomment-41084598 and https://github.com/pandas-dev/pandas/issues/4338 (I left links with code blocks to avoid messing up links to other repos).

Maybe I missed something here. What do you guys think?

@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])


def _correct_date_of_dataframe_from_arrow(pdf, schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with other methods in this file, how about _check_dataframe_convert_date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll update it.

@@ -4062,18 +4062,42 @@ def test_vectorized_udf_unsupported_types(self):
with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
df.select(f(col('map'))).collect()

def test_vectorized_udf_null_date(self):
def test_vectorized_udf_dates(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have a new test to directly verify the toPandas works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ArrowTests.test_toPandas_arrow_toggle:

def test_toPandas_arrow_toggle(self):
df = self.spark.createDataFrame(self.data, schema=self.schema)
pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
self.assertPandasEqual(pdf_arrow, pdf)

?

In addition, I'll modify it to check between its expected Pandas DataFrame.

@cloud-fan
Copy link
Contributor

@HyukjinKwon SGTM!

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87092 has finished for PR 20506 at commit f151cdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 6, 2018

LGTM, merging to master!

@cloud-fan
Copy link
Contributor

@ueshin can you send a new PR for 2.3? it conflicts, thanks!

@asfgit asfgit closed this in a24c031 Feb 6, 2018
ueshin added a commit to ueshin/apache-spark that referenced this pull request Feb 6, 2018
…rting Spark DataFrame to Pandas DataFrame.

## What changes were proposed in this pull request?

In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail.
See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example.

This pr modifies to use `datetime.date` for date type as Spark 2.2 does.

## How was this patch tested?

Tests modified to fit the new behavior and existing tests.

Author: Takuya UESHIN <[email protected]>

Closes apache#20506 from ueshin/issues/SPARK-23290.
@BryanCutler
Copy link
Member

a late +1 for me since it seems like Pandas needs an explicit conversion to get to datetime64 and doesn't directly support datetime.date

asfgit pushed a commit that referenced this pull request Feb 6, 2018
…ype when converting Spark DataFrame to Pandas DataFrame.

## What changes were proposed in this pull request?

This is a backport of #20506.

In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail.
See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example.

This pr modifies to use `datetime.date` for date type as Spark 2.2 does.

## How was this patch tested?

Tests modified to fit the new behavior and existing tests.

Author: Takuya UESHIN <[email protected]>

Closes #20515 from ueshin/issues/SPARK-23290_2.3.
@ueshin
Copy link
Member Author

ueshin commented Feb 6, 2018

Thanks! @HyukjinKwon @BryanCutler @cloud-fan

dansanduleac pushed a commit to palantir/spark that referenced this pull request Feb 6, 2018
…rting Spark DataFrame to Pandas DataFrame.

## What changes were proposed in this pull request?

In apache#18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail.
See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example.

This pr modifies to use `datetime.date` for date type as Spark 2.2 does.

## How was this patch tested?

Tests modified to fit the new behavior and existing tests.

Author: Takuya UESHIN <[email protected]>

Closes apache#20506 from ueshin/issues/SPARK-23290.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants