Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark #20288

Closed
wants to merge 11 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 17, 2018

What changes were proposed in this pull request?

This PR proposes to deprecate register* for UDFs in SQLContext and Catalog in Spark 2.3.0.

These are inconsistent with Scala / Java APIs and also these basically do the same things with spark.udf.register*.

Also, this PR moves the logcis from [sqlContext|spark.catalog].register* to spark.udf.register* and reuse the docstring.

This PR also handles minor doc corrections. It also includes #20158

How was this patch tested?

Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly.

@HyukjinKwon
Copy link
Member Author

@ueshin and @icexelloss (docstring reassignment) and @cloud-fan (deprecation), could you guys take a look and see if I understood your suggestions correctly?

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86234 has finished for PR 20288 at commit f63105c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class UDFRegistration(object):

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86247 has finished for PR 20288 at commit 08438ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb):
self.stop()


class UDFRegistration(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put it in udf.py?

@@ -624,6 +536,9 @@ def _test():
globs['os'] = os
globs['sc'] = sc
globs['sqlContext'] = SQLContext(sc)
# 'spark' alias is a small hack for reusing doctests. Please see the reassignment
# of docstrings above.
globs['spark'] = globs['sqlContext']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do globs['spark'] = globs['sqlContext'].sparkSession?

self.sparkSession = sparkSession

@ignore_unicode_prefix
def register(self, name, f, returnType=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add since 2.3?

registerFunction.__doc__ = """%s
.. note:: :func:`spark.catalog.registerFunction` is an alias
for :func:`spark.udf.register`.
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any plan (e.g. 3.0.0) to remove this alias?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this out WDYT @cloud-fan?

@@ -147,7 +147,8 @@ def udf(self):

:return: :class:`UDFRegistration`
"""
return UDFRegistration(self)
from pyspark.sql.session import UDFRegistration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we import UDFRegistration here again? Isn't it imported at the top?

[Row(stringLengthInt(test)=4)]

2. When `f` is a user-defined function, Spark uses the return type of the given a
user-defined function as the return type of the registered a user-defined function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the registered a user-defined function -> the registered user-defined function

>>> spark.sql("SELECT stringLengthInt('test')").collect()
[Row(stringLengthInt(test)=4)]

2. When `f` is a user-defined function, Spark uses the return type of the given a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the given a user-defined function -> of the given user-defined function

@@ -147,7 +147,8 @@ def udf(self):

:return: :class:`UDFRegistration`
"""
return UDFRegistration(self)
from pyspark.sql.session import UDFRegistration
return UDFRegistration(self.sparkSession)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about return self.sparkSession.udf?

@HyukjinKwon
Copy link
Member Author

Will try to double check and clean up soon. Comments so far above look all valid.

@@ -181,3 +183,180 @@ def asNondeterministic(self):
"""
self.deterministic = False
return self


class UDFRegistration(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems introduced from 1.3.1 - https://issues.apache.org/jira/browse/SPARK-6603


@ignore_unicode_prefix
@since(2.3)
def registerJavaFunction(self, name, javaClassName, returnType=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerJavaFunction and registerJavaUDAF look introduced from 2.3.0 - https://issues.apache.org/jira/browse/SPARK-19439

:func:`spark.udf.registerJavaFunction`
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.registerJavaFunction` instead.
.. versionadded:: 2.1
""" % _registerJavaFunction_doc[:_registerJavaFunction_doc.rfind('versionadded::')]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are fine to use this rfind way because since decorator adds versionadded:: at the end.


@ignore_unicode_prefix
@since(2.3)
def registerJavaUDAF(self, name, javaClassName):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are fine to remove this one because this is added within 2.3.0 timeline - https://issues.apache.org/jira/browse/SPARK-19439

for :func:`spark.udf.register`.
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
.. versionadded:: 2.0
""" % _register_doc[:_register_doc.rfind('versionadded::')]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-01-17 9 21 41

:func:`spark.udf.register`.
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
.. versionadded:: 1.2
""" % _register_doc[:_register_doc.rfind('versionadded::')]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-01-17 9 22 46

:func:`spark.udf.registerJavaFunction`
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.registerJavaFunction` instead.
.. versionadded:: 2.1
""" % _registerJavaFunction_doc[:_registerJavaFunction_doc.rfind('versionadded::')]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-01-17 9 22 57


.. note:: Registration for a user-defined function (case 2.) was added from
Spark 2.3.0.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-01-17 9 23 21

... "test.org.apache.spark.sql.JavaStringLength")
>>> spark.sql("SELECT javaStringLength2('test')").collect()
[Row(UDF:javaStringLength2(test)=4)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-01-17 9 23 28

@ueshin
Copy link
Member

ueshin commented Jan 17, 2018

LGTM pending Jenkins.

@cloud-fan
Copy link
Contributor

LGTM

@HyukjinKwon
Copy link
Member Author

Just fixed minor doc nits and double checked the built API documentation.

@viirya
Copy link
Member

viirya commented Jan 17, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86266 has finished for PR 20288 at commit c6ed44a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DeprecationWarning)
return self._sparkSession.udf.register(name, f, returnType)
# Reuse the docstring from UDFRegistration but with few notes.
_register_doc = UDFRegistration.register.__doc__.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to do sth like:

An alias for :func:`spark.udf.register`

.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
.. versionadded:: 2.0

So we don't need to copy the docstring.

But I am fine either way. @HyukjinKwon you can decide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait .. I think this is another good alternative. Will double check and be back today.

@icexelloss
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86264 has finished for PR 20288 at commit 6b9b9c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class UDFRegistration(object):

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86265 has finished for PR 20288 at commit f1fe40a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86273 has finished for PR 20288 at commit 08ffa1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86274 has finished for PR 20288 at commit 4367beb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @viirya, @ueshin and @cloud-fan, looks using links are cleaner and simpler. I think we can do such things in this way in the future too. I think we can also avoid duplicatedly running doctests too.

Let me know if you guys prefer the previous way. I can simply revert it. Last change made is basically - 3e0147b.

@@ -29,9 +29,10 @@
from pyspark.sql.readwriter import DataFrameReader
from pyspark.sql.streaming import DataStreamReader
from pyspark.sql.types import IntegerType, Row, StringType
from pyspark.sql.udf import UDFRegistration
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intendedly kept this to retain the import path pyspark.sql.context.UDFRegistration just in case.

"""An alias for :func:`spark.udf.register`.
See :meth:`pyspark.sql.UDFRegistration.register`.

.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shows the doc as below:

2018-01-18 10 28 46

I checked the link, pyspark.sql.UDFRegistration.register is correct.

@cloud-fan
Copy link
Contributor

LGTM

@ueshin
Copy link
Member

ueshin commented Jan 18, 2018

LGTM pending Jenkins.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86308 has finished for PR 20288 at commit c9512a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86306 has finished for PR 20288 at commit 3e0147b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86309 has finished for PR 20288 at commit e121273.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jan 18, 2018

Thanks! merging to master/2.3.

asfgit pushed a commit that referenced this pull request Jan 18, 2018
… and Catalog in PySpark

## What changes were proposed in this pull request?

This PR proposes to deprecate `register*` for UDFs in `SQLContext` and `Catalog` in Spark 2.3.0.

These are inconsistent with Scala / Java APIs and also these basically do the same things with `spark.udf.register*`.

Also, this PR moves the logcis from `[sqlContext|spark.catalog].register*` to `spark.udf.register*` and reuse the docstring.

This PR also handles minor doc corrections. It also includes #20158

## How was this patch tested?

Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly.

Author: hyukjinkwon <[email protected]>

Closes #20288 from HyukjinKwon/deprecate-udf.

(cherry picked from commit 39d244d)
Signed-off-by: Takuya UESHIN <[email protected]>
@asfgit asfgit closed this in 39d244d Jan 18, 2018
:return: a user-defined function.

`returnType` can be optionally specified when `f` is a Python function but not
when `f` is a user-defined function. Please see below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add another paragraph for explaining how to register a non-deterministic Python function? This sounds a common question from end users.

@gatorsmile
Copy link
Member

LGTM except one minor comment.

Could you submit a follow-up PR?

@HyukjinKwon
Copy link
Member Author

@gatorsmile, I am sorry i don't know why I missed this comment ..

@HyukjinKwon HyukjinKwon deleted the deprecate-udf branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants