Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

SPARK-30434 Move pandas related functionalities into 'pandas' sub-package #327

Merged
merged 3 commits into from
Jan 16, 2020

Conversation

zero323
Copy link
Owner

@zero323 zero323 commented Jan 15, 2020

@HyukjinKwon If it is not to much to ask, I'd really appreciate if you could take a quick glance and let me know if you see any obvious issues.

@@ -150,6 +150,22 @@ API Coverage
+------------------------------------------------+---------------------+--------------------+------------+
| `pyspark.sql.group`_ | ✘ | ✔ | |
+------------------------------------------------+---------------------+--------------------+------------+
| `pyspark.sql.pandas`_ | ✔ | ✘ | |
Copy link

@HyukjinKwon HyukjinKwon Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended pandas to be Internal too; however, I am maybe ignorant about the meaning of Internal and Mixed here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That crossed my mind, by at least _ops affect public API.

I think I should drop the idea of internal anyway - these days stub generators are good to keep these in sync.

@@ -75,7 +75,7 @@ pandas_udf(lambda *xs: 42, "str", PandasUDFType.GROUPED_AGG)


[case mapIterUdf]
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not so much aware of this code base; however, just wanted to note:

  • original import will work as was.
  • MAP ITER became mapInPandas

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original import will work as was.

That's primarily to check both sides. In general MyPy is pretty strict about these things so

from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType

would type check fine in pyspark.sql.functions but, not as transitive import. For that I had to switch to aliases.

from pyspark.sql.pandas.functions import pandas_udf as pandas_udf, PandasUDFType as PandasUDFType

Copy link

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a cursory look.

@zero323
Copy link
Owner Author

zero323 commented Jan 16, 2020

Looks good from a cursory look.

I couldn't ask for more. Thank you so much @HyukjinKwon!

@zero323 zero323 merged commit ae50c65 into master Jan 16, 2020
@zero323 zero323 deleted the SPARK-30434 branch January 16, 2020 13:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants