-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403
Conversation
python/pyspark/sql/dataframe.py
Outdated
@@ -1746,7 +1746,7 @@ def toPandas(self): | |||
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) | |||
|
|||
for f, t in dtype.items(): | |||
pdf[f] = pdf[f].astype(t, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, @viirya, @BryanCutler, @ueshin and @holdenk, while I was testing this, I realised that actually it looks copy
is exposed from 0.13.0. I was confused that it was added from 0.11.0 - here. However, it sounds hidden from 0.11.0 to 0.12.0.
What do you think about this? I think it sounds safer to not use it for now (as I found the doc says we should be careful) and we can support Pandas 0.11.0 and 0.12.0.
It is still not a big deal maybe. 0.13.0 was released 3.5 years ago. Please let me know. I can just fix the version to 0.13.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess it says we should be careful when using it, is because if copy = False
, you may unintentionally change the data of previous DataFrame that shares the same data. But in our case, I think it is safe.
I have no strong opinion for using copy = False
. But as you said, I think it's fine to set the version as 0.13.0. Let's see others' opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, seems to me we call astype
on Series
instead of DataFrame
? Series.astype
has copy
parameter since 0.13.0.
http://pandas.pydata.org/pandas-docs/version/0.13.0/generated/pandas.Series.astype.html
http://pandas.pydata.org/pandas-docs/version/0.12.0/generated/pandas.Series.astype.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, in any event I was mistaken ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you mean fixing the deacription? Let me check and update. Thanks!
Test build #78522 has finished for PR 18403 at commit
|
python/setup.py
Outdated
@@ -199,7 +199,7 @@ def _supports_symlinks(): | |||
extras_require={ | |||
'ml': ['numpy>=1.7'], | |||
'mllib': ['numpy>=1.7'], | |||
'sql': ['pandas'] | |||
'sql': ['pandas>=0.11.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to require 0.13.0, we may need other new Pandas APIs inside pyspark in the future, that only exist after 0.13.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, either way is fine to me.
Thank you @srowen, @viirya and @cloud-fan. |
LGTM |
Test build #78527 has finished for PR 18403 at commit
|
thanks, merging to master! |
@@ -199,7 +199,7 @@ def _supports_symlinks(): | |||
extras_require={ | |||
'ml': ['numpy>=1.7'], | |||
'mllib': ['numpy>=1.7'], | |||
'sql': ['pandas'] | |||
'sql': ['pandas>=0.13.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, to add Arrow dependency, can we just add one more entry for arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, yes. I guess pyarrow
is not meant to be a hard dependency requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @BryanCutler, I just wonder if you are going to open a small follow up for Arrow and related minor doc changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for researching this @HyukjinKwon! I opened a follow-up to add more type support. I can do related docs there and we could also discuss whether or not to add pyarrow to the setup.py file once that's complete.
## What changes were proposed in this pull request? It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0. Running the codes below: ```python from pyspark.sql.types import * schema = StructType().add("a", IntegerType()).add("b", StringType())\ .add("c", BooleanType()).add("d", FloatType()) data = [ (1, "foo", True, 3.0,), (2, "foo", True, 5.0), (3, "bar", False, -1.0), (4, "bar", False, 6.0), ] spark.createDataFrame(data, schema).toPandas().dtypes ``` prints ... **With Pandas 0.13.0** - released, 2014-01 ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.12.0** - - released, 2013-06 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.11.0** - released, 2013-03 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int32 b object c bool d float32 dtype: object ``` **With Pandas 0.10.0** - released, 2012-12 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy' ``` without `copy` ``` a int64 # <- this should be 'int32' b object c bool d float64 # <- this should be 'float32' ``` ## How was this patch tested? Manually tested with Pandas from 0.10.0 to 0.13.0. Author: hyukjinkwon <[email protected]> Closes apache#18403 from HyukjinKwon/SPARK-21193.
What changes were proposed in this pull request?
It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.
Running the codes below:
prints ...
With Pandas 0.13.0 - released, 2014-01
With Pandas 0.12.0 - - released, 2013-06
without
copy
With Pandas 0.11.0 - released, 2013-03
without
copy
With Pandas 0.10.0 - released, 2012-12
without
copy
How was this patch tested?
Manually tested with Pandas from 0.10.0 to 0.13.0.