-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some images and comment to make review easier.
python/pyspark/sql/column.py
Outdated
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseOR(df3.b)).collect() | ||
[Row((a | b)=235)] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is matched with Scala one.
Compute bitwise OR of this expression with another expression
python/pyspark/sql/column.py
Outdated
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseAND(df3.b)).collect() | ||
[Row((a & b)=10)] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is matched with Scala one.
Compute bitwise AND of this expression with another expression
python/pyspark/sql/column.py
Outdated
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseXOR(df3.b)).collect() | ||
[Row((a ^ b)=225)] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is matched with Scala one.
Compute bitwise XOR of this expression with another expression
@@ -251,15 +286,16 @@ def __iter__(self): | |||
|
|||
# string methods | |||
_rlike_doc = """ | |||
Return a Boolean :class:`Column` based on a regex match. | |||
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex | |||
match. | |||
|
|||
:param other: an extended regex expression | |||
|
|||
>>> df.filter(df.name.rlike('ice$')).collect() | |||
[Row(age=2, name=u'Alice')] | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -269,17 +305,17 @@ def __iter__(self): | |||
[Row(age=2, name=u'Alice')] | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/column.py
Outdated
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.select(df2.name).orderBy(df2.name.asc()).collect() | ||
[Row(name=u'Alice'), Row(name=u'Tom')] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/column.py
Outdated
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.select(df2.name).orderBy(df2.name.desc()).collect() | ||
[Row(name=u'Tom'), Row(name=u'Alice')] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -527,7 +584,7 @@ def _test(): | |||
.appName("sql.column tests")\ | |||
.getOrCreate() | |||
sc = spark.sparkContext | |||
globs['sc'] = sc | |||
globs['spark'] = spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed sc
and replaced it to spark
as I think we promote this way up to my knowledge.
@@ -86,7 +86,7 @@ case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmet | |||
} | |||
|
|||
/** | |||
* A function that calculates bitwise xor of two numbers. | |||
* A function that calculates bitwise xor({@literal ^}) of two numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching it up with BitwiseAnd
and BitwiseOr
where
A function that calculates bitwise and(&) of two numbers.
A function that calculates bitwise or(|) of two numbers.
@@ -1008,7 +1009,7 @@ class Column(val expr: Expression) extends Logging { | |||
def cast(to: String): Column = cast(CatalystSqlParser.parseDataType(to)) | |||
|
|||
/** | |||
* Returns an ordering used in sorting. | |||
* Returns a sort expression based on the descending order of the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the similar instances below are matched with functions.scala
. They look calling the same ones.
Returns a sort expression based on the descending order of the column.
Returns a sort expression based on the descending order of the column,
and null values appear before non-null values.
Returns a sort expression based on the descending order of the column,
and null values appear after non-null values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to include example usages, as in the python documentation? E.g. for rlike,
* {{{
* // find names that start with "Al"
* scala> df.filter( $"name".like("Al%") ).collect()
* Array([Alice,1])
* }}}
I made four examples for rlike
, like
, startsWith
, and endsWith
here: map222@28f97d3
It would also be helpful for the startsWith
and endsWith
functions, where there are two versions, e.g. startsWith(other: Column)
and startsWith(literal: String)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that sounds good in a way but the downside of adding examples is to maintain and keep them up to date. Let's leave them out here as this PR targets to fix Python documentation.
|
||
_isNull_doc = """ | ||
True if the current expression is null. Often combined with | ||
:func:`DataFrame.filter` to select rows with null values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often combined with :func:`DataFrame.filter` to select rows with null values.
was removed because it looks applying to many other APIs and look too much. It just follows Scala one now.
cc @srowen, @holdenk, @felixcheung, @map222 and @zero323 who were in related PRs. |
Test build #76092 has finished for PR 17737 at commit
|
Test build #76093 has finished for PR 17737 at commit
|
Test build #76095 has finished for PR 17737 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @holdenk
python/pyspark/sql/column.py
Outdated
@@ -251,15 +285,16 @@ def __iter__(self): | |||
|
|||
# string methods | |||
_rlike_doc = """ | |||
Return a Boolean :class:`Column` based on a regex match. | |||
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify that rlike
uses not Python, but Java regular expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave so that it indicates the regular expression is in SQL syntax. I would like to keep them identically in most cases to reduce the overhead when someone needs to fix the documentation across APIs in other languages.
It looks there are few more places that need the clarification (if needed). If this is something that has to be done, then, let's do this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is that in Scala or Java users get regular expressions dialect they expect. In Python they don't (for example with referencing groups).
But fair enough. Let's leave it for another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Test build #76129 has started for PR 17737 at commit |
Thank you for your review and approval @felixcheung, @zero323 and @map222. |
@@ -527,7 +583,7 @@ def _test(): | |||
.appName("sql.column tests")\ | |||
.getOrCreate() | |||
sc = spark.sparkContext | |||
globs['sc'] = sc | |||
globs['spark'] = spark | |||
globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to update the globs['df']
definition to spark.createDataFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could. I think this is not related with Python documentation fix BTW.
The point here is to add missing Python documentation and it matches them in Python's I hope other extra changes do not hold off this PR. |
retest this please |
Test build #76130 has finished for PR 17737 at commit
|
@holdenk do you have bandwidth to review this or ok with me pushing this to master? |
gentle ping @holdenk |
python/pyspark/sql/column.py
Outdated
this :class:`Column`. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why df3
instead of df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a global df
variable when running doctests and I guess it was avoided to shadow the same name from the outer scope in some doctests whereas other doctests just shadow it. I get your point. In documentation, we will only see only the code block and I guess using df
might be slightly better.
AFAIK, usually, Python documentation have self-contained doctests in general so I don't know which case is better and correct. If you could confirm this, I can sweep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me correct this as probably I guess you prefer df
and I don't have preference.
@holdenk, @felixcheung and @gatorsmile, could this get merged? |
Test build #76302 has finished for PR 17737 at commit
|
I can take a review look today, sorry this week has been so busy. |
Thank you for improving the documentation @HyukjinKwon looks good to me. |
And thanks to everyone for reviewing, I'll merge this to master :) |
What changes were proposed in this pull request?
This PR proposes to fill up the documentation with examples for
bitwiseOR
,bitwiseAND
,bitwiseXOR
.contains
,asc
anddesc
inColumn
API.Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
Lastly, this PR suggests to use
spark
rather thansc
in doc tests inColumn
for Python documentation.How was this patch tested?
Doc tests were added and manually tested with the commands below:
./python/run-tests.py --module pyspark-sql
./python/run-tests.py --module pyspark-sql --python-executable python3
./dev/lint-python
Output was checked via
make html
under./python/docs
. The snapshots will be left on the codes with comments.