[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

HyukjinKwon · 2017-06-23T10:49:17Z

What changes were proposed in this pull request?

It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.

Running the codes below:

from pyspark.sql.types import *

schema = StructType().add("a", IntegerType()).add("b", StringType())\
                     .add("c", BooleanType()).add("d", FloatType())
data = [
    (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
    (3, "bar", False, -1.0), (4, "bar", False, 6.0),
]
spark.createDataFrame(data, schema).toPandas().dtypes

prints ...

With Pandas 0.13.0 - released, 2014-01

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.12.0 - - released, 2013-06

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.11.0 - released, 2013-03

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.10.0 - released, 2012-12

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int64  # <- this should be 'int32'
b     object
c       bool
d    float64  # <- this should be 'float32'

How was this patch tested?

Manually tested with Pandas from 0.10.0 to 0.13.0.

HyukjinKwon · 2017-06-23T10:52:38Z

python/pyspark/sql/dataframe.py

@@ -1746,7 +1746,7 @@ def toPandas(self):
            pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

            for f, t in dtype.items():
-                pdf[f] = pdf[f].astype(t, copy=False)


@cloud-fan, @viirya, @BryanCutler, @ueshin and @holdenk, while I was testing this, I realised that actually it looks copy is exposed from 0.13.0. I was confused that it was added from 0.11.0 - here. However, it sounds hidden from 0.11.0 to 0.12.0.

What do you think about this? I think it sounds safer to not use it for now (as I found the doc says we should be careful) and we can support Pandas 0.11.0 and 0.12.0.

It is still not a big deal maybe. 0.13.0 was released 3.5 years ago. Please let me know. I can just fix the version to 0.13.0.

Hmm, I guess it says we should be careful when using it, is because if copy = False, you may unintentionally change the data of previous DataFrame that shares the same data. But in our case, I think it is safe.

I have no strong opinion for using copy = False. But as you said, I think it's fine to set the version as 0.13.0. Let's see others' opinion.

Btw, seems to me we call astype on Series instead of DataFrame? Series.astype has copy parameter since 0.13.0.

http://pandas.pydata.org/pandas-docs/version/0.13.0/generated/pandas.Series.astype.html

http://pandas.pydata.org/pandas-docs/version/0.12.0/generated/pandas.Series.astype.html

Yea, in any event I was mistaken ...

Ah, you mean fixing the deacription? Let me check and update. Thanks!

SparkQA · 2017-06-23T11:24:16Z

Test build #78522 has finished for PR 18403 at commit 4ddc54b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-23T12:39:24Z

python/setup.py

@@ -199,7 +199,7 @@ def _supports_symlinks():
        extras_require={
            'ml': ['numpy>=1.7'],
            'mllib': ['numpy>=1.7'],
-            'sql': ['pandas']
+            'sql': ['pandas>=0.11.0']


I think it's ok to require 0.13.0, we may need other new Pandas APIs inside pyspark in the future, that only exist after 0.13.0.

Sure, either way is fine to me.

HyukjinKwon · 2017-06-23T12:46:19Z

Thank you @srowen, @viirya and @cloud-fan.

viirya · 2017-06-23T12:48:15Z

LGTM

SparkQA · 2017-06-23T13:20:44Z

Test build #78527 has finished for PR 18403 at commit e04fb89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-23T13:52:22Z

thanks, merging to master!

cloud-fan · 2017-06-23T13:53:57Z

python/setup.py

@@ -199,7 +199,7 @@ def _supports_symlinks():
        extras_require={
            'ml': ['numpy>=1.7'],
            'mllib': ['numpy>=1.7'],
-            'sql': ['pandas']
+            'sql': ['pandas>=0.13.0']


BTW, to add Arrow dependency, can we just add one more entry for arrow?

AFAIK, yes. I guess pyarrow is not meant to be a hard dependency requirement.

Actually, @BryanCutler, I just wonder if you are going to open a small follow up for Arrow and related minor doc changes?

Thanks for researching this @HyukjinKwon! I opened a follow-up to add more type support. I can do related docs there and we could also discuss whether or not to add pyarrow to the setup.py file once that's complete.

## What changes were proposed in this pull request? It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0. Running the codes below: ```python from pyspark.sql.types import * schema = StructType().add("a", IntegerType()).add("b", StringType())\ .add("c", BooleanType()).add("d", FloatType()) data = [ (1, "foo", True, 3.0,), (2, "foo", True, 5.0), (3, "bar", False, -1.0), (4, "bar", False, 6.0), ] spark.createDataFrame(data, schema).toPandas().dtypes ``` prints ... **With Pandas 0.13.0** - released, 2014-01 ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.12.0** - - released, 2013-06 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.11.0** - released, 2013-03 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.10.0** - released, 2012-12 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int64 # <- this should be 'int32' b object c bool d float64 # <- this should be 'float32' ``` ## How was this patch tested? Manually tested with Pandas from 0.10.0 to 0.13.0. Author: hyukjinkwon <[email protected]> Closes apache#18403 from HyukjinKwon/SPARK-21193.

HyukjinKwon commented Jun 23, 2017

View reviewed changes

srowen approved these changes Jun 23, 2017

View reviewed changes

cloud-fan reviewed Jun 23, 2017

View reviewed changes

Simply fix it to 0.13.0

e04fb89

HyukjinKwon force-pushed the SPARK-21193 branch from 4ddc54b to e04fb89 Compare June 23, 2017 12:43

cloud-fan approved these changes Jun 23, 2017

View reviewed changes

cloud-fan reviewed Jun 23, 2017

View reviewed changes

asfgit closed this in 5dca10b Jun 23, 2017

ueshin mentioned this pull request Nov 2, 2017

[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone #19607

Closed

HyukjinKwon deleted the SPARK-21193 branch January 2, 2018 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

HyukjinKwon commented Jun 23, 2017 •

edited

Loading

HyukjinKwon Jun 23, 2017 •

edited

Loading

viirya Jun 23, 2017 •

edited

Loading

viirya Jun 23, 2017 •

edited

Loading

HyukjinKwon Jun 23, 2017

HyukjinKwon Jun 23, 2017

SparkQA commented Jun 23, 2017

cloud-fan Jun 23, 2017

HyukjinKwon Jun 23, 2017

HyukjinKwon commented Jun 23, 2017

viirya commented Jun 23, 2017

SparkQA commented Jun 23, 2017

cloud-fan commented Jun 23, 2017

cloud-fan Jun 23, 2017

HyukjinKwon Jun 23, 2017

HyukjinKwon Jun 23, 2017

BryanCutler Jun 24, 2017

[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

Conversation

HyukjinKwon commented Jun 23, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Jun 23, 2017 • edited Loading

Choose a reason for hiding this comment

viirya Jun 23, 2017 • edited Loading

Choose a reason for hiding this comment

viirya Jun 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 23, 2017

viirya commented Jun 23, 2017

SparkQA commented Jun 23, 2017

cloud-fan commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 23, 2017 •

edited

Loading

HyukjinKwon Jun 23, 2017 •

edited

Loading

viirya Jun 23, 2017 •

edited

Loading

viirya Jun 23, 2017 •

edited

Loading