Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21193][PYTHON] Specify Pandas version in setup.py #18403

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 23, 2017

What changes were proposed in this pull request?

It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.

Running the codes below:

from pyspark.sql.types import *

schema = StructType().add("a", IntegerType()).add("b", StringType())\
                     .add("c", BooleanType()).add("d", FloatType())
data = [
    (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
    (3, "bar", False, -1.0), (4, "bar", False, 6.0),
]
spark.createDataFrame(data, schema).toPandas().dtypes

prints ...

With Pandas 0.13.0 - released, 2014-01

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.12.0 - - released, 2013-06

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.11.0 - released, 2013-03

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int32
b     object
c       bool
d    float32
dtype: object

With Pandas 0.10.0 - released, 2012-12

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'

without copy

a      int64  # <- this should be 'int32'
b     object
c       bool
d    float64  # <- this should be 'float32'

How was this patch tested?

Manually tested with Pandas from 0.10.0 to 0.13.0.

@@ -1746,7 +1746,7 @@ def toPandas(self):
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)
Copy link
Member Author

@HyukjinKwon HyukjinKwon Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, @viirya, @BryanCutler, @ueshin and @holdenk, while I was testing this, I realised that actually it looks copy is exposed from 0.13.0. I was confused that it was added from 0.11.0 - here. However, it sounds hidden from 0.11.0 to 0.12.0.

What do you think about this? I think it sounds safer to not use it for now (as I found the doc says we should be careful) and we can support Pandas 0.11.0 and 0.12.0.

It is still not a big deal maybe. 0.13.0 was released 3.5 years ago. Please let me know. I can just fix the version to 0.13.0.

Copy link
Member

@viirya viirya Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it says we should be careful when using it, is because if copy = False, you may unintentionally change the data of previous DataFrame that shares the same data. But in our case, I think it is safe.

I have no strong opinion for using copy = False. But as you said, I think it's fine to set the version as 0.13.0. Let's see others' opinion.

Copy link
Member

@viirya viirya Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in any event I was mistaken ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean fixing the deacription? Let me check and update. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78522 has finished for PR 18403 at commit 4ddc54b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/setup.py Outdated
@@ -199,7 +199,7 @@ def _supports_symlinks():
extras_require={
'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
'sql': ['pandas']
'sql': ['pandas>=0.11.0']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to require 0.13.0, we may need other new Pandas APIs inside pyspark in the future, that only exist after 0.13.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, either way is fine to me.

@HyukjinKwon
Copy link
Member Author

Thank you @srowen, @viirya and @cloud-fan.

@viirya
Copy link
Member

viirya commented Jun 23, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78527 has finished for PR 18403 at commit e04fb89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@@ -199,7 +199,7 @@ def _supports_symlinks():
extras_require={
'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
'sql': ['pandas']
'sql': ['pandas>=0.13.0']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, to add Arrow dependency, can we just add one more entry for arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, yes. I guess pyarrow is not meant to be a hard dependency requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @BryanCutler, I just wonder if you are going to open a small follow up for Arrow and related minor doc changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for researching this @HyukjinKwon! I opened a follow-up to add more type support. I can do related docs there and we could also discuss whether or not to add pyarrow to the setup.py file once that's complete.

@asfgit asfgit closed this in 5dca10b Jun 23, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?

It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.

Running the codes below:

```python
from pyspark.sql.types import *

schema = StructType().add("a", IntegerType()).add("b", StringType())\
                     .add("c", BooleanType()).add("d", FloatType())
data = [
    (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
    (3, "bar", False, -1.0), (4, "bar", False, 6.0),
]
spark.createDataFrame(data, schema).toPandas().dtypes
```

prints ...

**With Pandas 0.13.0** - released, 2014-01

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.12.0** -  - released, 2013-06

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.11.0** - released, 2013-03

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int32
b     object
c       bool
d    float32
dtype: object
```

**With Pandas 0.10.0** -  released, 2012-12

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
    pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```

without `copy`

```
a      int64  # <- this should be 'int32'
b     object
c       bool
d    float64  # <- this should be 'float32'
```

## How was this patch tested?

Manually tested with Pandas from 0.10.0 to 0.13.0.

Author: hyukjinkwon <[email protected]>

Closes apache#18403 from HyukjinKwon/SPARK-21193.
@HyukjinKwon HyukjinKwon deleted the SPARK-21193 branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants