-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark #20288
Conversation
@ueshin and @icexelloss (docstring reassignment) and @cloud-fan (deprecation), could you guys take a look and see if I understood your suggestions correctly? |
Test build #86234 has finished for PR 20288 at commit
|
Test build #86247 has finished for PR 20288 at commit
|
python/pyspark/sql/session.py
Outdated
@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb): | |||
self.stop() | |||
|
|||
|
|||
class UDFRegistration(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we put it in udf.py
?
python/pyspark/sql/context.py
Outdated
@@ -624,6 +536,9 @@ def _test(): | |||
globs['os'] = os | |||
globs['sc'] = sc | |||
globs['sqlContext'] = SQLContext(sc) | |||
# 'spark' alias is a small hack for reusing doctests. Please see the reassignment | |||
# of docstrings above. | |||
globs['spark'] = globs['sqlContext'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do globs['spark'] = globs['sqlContext'].sparkSession
?
python/pyspark/sql/session.py
Outdated
self.sparkSession = sparkSession | ||
|
||
@ignore_unicode_prefix | ||
def register(self, name, f, returnType=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add since 2.3
?
python/pyspark/sql/catalog.py
Outdated
registerFunction.__doc__ = """%s | ||
.. note:: :func:`spark.catalog.registerFunction` is an alias | ||
for :func:`spark.udf.register`. | ||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any plan (e.g. 3.0.0) to remove this alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this out WDYT @cloud-fan?
python/pyspark/sql/context.py
Outdated
@@ -147,7 +147,8 @@ def udf(self): | |||
|
|||
:return: :class:`UDFRegistration` | |||
""" | |||
return UDFRegistration(self) | |||
from pyspark.sql.session import UDFRegistration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we import UDFRegistration
here again? Isn't it imported at the top?
python/pyspark/sql/session.py
Outdated
[Row(stringLengthInt(test)=4)] | ||
|
||
2. When `f` is a user-defined function, Spark uses the return type of the given a | ||
user-defined function as the return type of the registered a user-defined function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the registered a user-defined function -> the registered user-defined function
python/pyspark/sql/session.py
Outdated
>>> spark.sql("SELECT stringLengthInt('test')").collect() | ||
[Row(stringLengthInt(test)=4)] | ||
|
||
2. When `f` is a user-defined function, Spark uses the return type of the given a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the given a user-defined function -> of the given user-defined function
python/pyspark/sql/context.py
Outdated
@@ -147,7 +147,8 @@ def udf(self): | |||
|
|||
:return: :class:`UDFRegistration` | |||
""" | |||
return UDFRegistration(self) | |||
from pyspark.sql.session import UDFRegistration | |||
return UDFRegistration(self.sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about return self.sparkSession.udf
?
Will try to double check and clean up soon. Comments so far above look all valid. |
@@ -181,3 +183,180 @@ def asNondeterministic(self): | |||
""" | |||
self.deterministic = False | |||
return self | |||
|
|||
|
|||
class UDFRegistration(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems introduced from 1.3.1 - https://issues.apache.org/jira/browse/SPARK-6603
|
||
@ignore_unicode_prefix | ||
@since(2.3) | ||
def registerJavaFunction(self, name, javaClassName, returnType=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerJavaFunction
and registerJavaUDAF
look introduced from 2.3.0 - https://issues.apache.org/jira/browse/SPARK-19439
python/pyspark/sql/context.py
Outdated
:func:`spark.udf.registerJavaFunction` | ||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.registerJavaFunction` instead. | ||
.. versionadded:: 2.1 | ||
""" % _registerJavaFunction_doc[:_registerJavaFunction_doc.rfind('versionadded::')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are fine to use this rfind
way because since
decorator adds versionadded::
at the end.
python/pyspark/sql/context.py
Outdated
|
||
@ignore_unicode_prefix | ||
@since(2.3) | ||
def registerJavaUDAF(self, name, javaClassName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are fine to remove this one because this is added within 2.3.0 timeline - https://issues.apache.org/jira/browse/SPARK-19439
python/pyspark/sql/catalog.py
Outdated
for :func:`spark.udf.register`. | ||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead. | ||
.. versionadded:: 2.0 | ||
""" % _register_doc[:_register_doc.rfind('versionadded::')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/context.py
Outdated
:func:`spark.udf.register`. | ||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead. | ||
.. versionadded:: 1.2 | ||
""" % _register_doc[:_register_doc.rfind('versionadded::')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/context.py
Outdated
:func:`spark.udf.registerJavaFunction` | ||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.registerJavaFunction` instead. | ||
.. versionadded:: 2.1 | ||
""" % _registerJavaFunction_doc[:_registerJavaFunction_doc.rfind('versionadded::')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
.. note:: Registration for a user-defined function (case 2.) was added from | ||
Spark 2.3.0. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/udf.py
Outdated
... "test.org.apache.spark.sql.JavaStringLength") | ||
>>> spark.sql("SELECT javaStringLength2('test')").collect() | ||
[Row(UDF:javaStringLength2(test)=4)] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending Jenkins. |
LGTM |
Just fixed minor doc nits and double checked the built API documentation. |
LGTM |
Test build #86266 has finished for PR 20288 at commit
|
python/pyspark/sql/catalog.py
Outdated
DeprecationWarning) | ||
return self._sparkSession.udf.register(name, f, returnType) | ||
# Reuse the docstring from UDFRegistration but with few notes. | ||
_register_doc = UDFRegistration.register.__doc__.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is to do sth like:
An alias for :func:`spark.udf.register`
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead.
.. versionadded:: 2.0
So we don't need to copy the docstring.
But I am fine either way. @HyukjinKwon you can decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait .. I think this is another good alternative. Will double check and be back today.
LGTM |
Test build #86264 has finished for PR 20288 at commit
|
Test build #86265 has finished for PR 20288 at commit
|
Test build #86273 has finished for PR 20288 at commit
|
Test build #86274 has finished for PR 20288 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @viirya, @ueshin and @cloud-fan, looks using links are cleaner and simpler. I think we can do such things in this way in the future too. I think we can also avoid duplicatedly running doctests too.
Let me know if you guys prefer the previous way. I can simply revert it. Last change made is basically - 3e0147b.
@@ -29,9 +29,10 @@ | |||
from pyspark.sql.readwriter import DataFrameReader | |||
from pyspark.sql.streaming import DataStreamReader | |||
from pyspark.sql.types import IntegerType, Row, StringType | |||
from pyspark.sql.udf import UDFRegistration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intendedly kept this to retain the import path pyspark.sql.context.UDFRegistration
just in case.
"""An alias for :func:`spark.udf.register`. | ||
See :meth:`pyspark.sql.UDFRegistration.register`. | ||
|
||
.. note:: Deprecated in 2.3.0. Use :func:`spark.udf.register` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM |
LGTM pending Jenkins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #86308 has finished for PR 20288 at commit
|
Test build #86306 has finished for PR 20288 at commit
|
Test build #86309 has finished for PR 20288 at commit
|
Thanks! merging to master/2.3. |
… and Catalog in PySpark ## What changes were proposed in this pull request? This PR proposes to deprecate `register*` for UDFs in `SQLContext` and `Catalog` in Spark 2.3.0. These are inconsistent with Scala / Java APIs and also these basically do the same things with `spark.udf.register*`. Also, this PR moves the logcis from `[sqlContext|spark.catalog].register*` to `spark.udf.register*` and reuse the docstring. This PR also handles minor doc corrections. It also includes #20158 ## How was this patch tested? Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly. Author: hyukjinkwon <[email protected]> Closes #20288 from HyukjinKwon/deprecate-udf. (cherry picked from commit 39d244d) Signed-off-by: Takuya UESHIN <[email protected]>
:return: a user-defined function. | ||
|
||
`returnType` can be optionally specified when `f` is a Python function but not | ||
when `f` is a user-defined function. Please see below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add another paragraph for explaining how to register a non-deterministic Python function? This sounds a common question from end users.
LGTM except one minor comment. Could you submit a follow-up PR? |
@gatorsmile, I am sorry i don't know why I missed this comment .. |
What changes were proposed in this pull request?
This PR proposes to deprecate
register*
for UDFs inSQLContext
andCatalog
in Spark 2.3.0.These are inconsistent with Scala / Java APIs and also these basically do the same things with
spark.udf.register*
.Also, this PR moves the logcis from
[sqlContext|spark.catalog].register*
tospark.udf.register*
and reuse the docstring.This PR also handles minor doc corrections. It also includes #20158
How was this patch tested?
Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly.