[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737

HyukjinKwon · 2017-04-24T03:43:01Z

What changes were proposed in this pull request?

This PR proposes to fill up the documentation with examples for bitwiseOR, bitwiseAND, bitwiseXOR. contains, asc and desc in Column API.

Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.

Lastly, this PR suggests to use spark rather than sc in doc tests in Column for Python documentation.

How was this patch tested?

Doc tests were added and manually tested with the commands below:

./python/run-tests.py --module pyspark-sql
./python/run-tests.py --module pyspark-sql --python-executable python3
./dev/lint-python

Output was checked via make html under ./python/docs. The snapshots will be left on the codes with comments.

…mn.py

HyukjinKwon

I left some images and comment to make review easier.

HyukjinKwon · 2017-04-24T03:43:55Z

python/pyspark/sql/column.py

+    >>> df3 = spark.createDataFrame([Row(a=170, b=75)])
+    >>> df3.select(df3.a.bitwiseOR(df3.b)).collect()
+    [Row((a | b)=235)]
+    """


This is matched with Scala one.

Compute bitwise OR of this expression with another expression

HyukjinKwon · 2017-04-24T03:44:19Z

python/pyspark/sql/column.py

+    >>> df3 = spark.createDataFrame([Row(a=170, b=75)])
+    >>> df3.select(df3.a.bitwiseAND(df3.b)).collect()
+    [Row((a & b)=10)]
+    """


This is matched with Scala one.

Compute bitwise AND of this expression with another expression

HyukjinKwon · 2017-04-24T03:44:20Z

python/pyspark/sql/column.py

+    >>> df3 = spark.createDataFrame([Row(a=170, b=75)])
+    >>> df3.select(df3.a.bitwiseXOR(df3.b)).collect()
+    [Row((a ^ b)=225)]
+    """


This is matched with Scala one.

Compute bitwise XOR of this expression with another expression

HyukjinKwon · 2017-04-24T03:45:20Z

python/pyspark/sql/column.py

@@ -251,15 +286,16 @@ def __iter__(self):

    # string methods
    _rlike_doc = """
-    Return a Boolean :class:`Column` based on a regex match.
+    SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex
+    match.

    :param other: an extended regex expression

    >>> df.filter(df.name.rlike('ice$')).collect()
    [Row(age=2, name=u'Alice')]
    """


HyukjinKwon · 2017-04-24T03:45:24Z

python/pyspark/sql/column.py

@@ -269,17 +305,17 @@ def __iter__(self):
    [Row(age=2, name=u'Alice')]
    """


HyukjinKwon · 2017-04-24T03:56:16Z

python/pyspark/sql/column.py

+    >>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
+    >>> df2.select(df2.name).orderBy(df2.name.asc()).collect()
+    [Row(name=u'Alice'), Row(name=u'Tom')]
+    """


HyukjinKwon · 2017-04-24T03:56:17Z

python/pyspark/sql/column.py

+    >>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
+    >>> df2.select(df2.name).orderBy(df2.name.desc()).collect()
+    [Row(name=u'Tom'), Row(name=u'Alice')]
+    """


HyukjinKwon · 2017-04-24T03:56:54Z

python/pyspark/sql/column.py

@@ -527,7 +584,7 @@ def _test():
        .appName("sql.column tests")\
        .getOrCreate()
    sc = spark.sparkContext
-    globs['sc'] = sc
+    globs['spark'] = spark


I removed sc and replaced it to spark as I think we promote this way up to my knowledge.

HyukjinKwon · 2017-04-24T03:57:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala

@@ -86,7 +86,7 @@ case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmet
 }

 /**
- * A function that calculates bitwise xor of two numbers.
+ * A function that calculates bitwise xor({@literal ^}) of two numbers.


Matching it up with BitwiseAnd and BitwiseOr where

A function that calculates bitwise and(&) of two numbers.

A function that calculates bitwise or(|) of two numbers.

HyukjinKwon · 2017-04-24T04:00:02Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

@@ -1008,7 +1009,7 @@ class Column(val expr: Expression) extends Logging {
  def cast(to: String): Column = cast(CatalystSqlParser.parseDataType(to))

  /**
-   * Returns an ordering used in sorting.
+   * Returns a sort expression based on the descending order of the column.


This and the similar instances below are matched with functions.scala. They look calling the same ones.

Returns a sort expression based on the descending order of the column.

Returns a sort expression based on the descending order of the column,
and null values appear before non-null values.

Returns a sort expression based on the descending order of the column,
and null values appear after non-null values.

Do you want to include example usages, as in the python documentation? E.g. for rlike,

* {{{ * // find names that start with "Al" * scala> df.filter( $"name".like("Al%") ).collect() * Array([Alice,1]) * }}}

I made four examples for rlike, like, startsWith, and endsWith here: map222@28f97d3

It would also be helpful for the startsWith and endsWith functions, where there are two versions, e.g. startsWith(other: Column) and startsWith(literal: String).

Yea, that sounds good in a way but the downside of adding examples is to maintain and keep them up to date. Let's leave them out here as this PR targets to fix Python documentation.

HyukjinKwon · 2017-04-24T04:05:11Z

python/pyspark/sql/column.py


    _isNull_doc = """
-    True if the current expression is null. Often combined with
-    :func:`DataFrame.filter` to select rows with null values.


Often combined with :func:`DataFrame.filter` to select rows with null values. was removed because it looks applying to many other APIs and look too much. It just follows Scala one now.

HyukjinKwon · 2017-04-24T04:06:50Z

cc @srowen, @holdenk, @felixcheung, @map222 and @zero323 who were in related PRs.

SparkQA · 2017-04-24T06:03:27Z

Test build #76092 has finished for PR 17737 at commit bb5de1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-24T06:13:57Z

Test build #76093 has finished for PR 17737 at commit af8ac74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-24T06:38:53Z

Test build #76095 has finished for PR 17737 at commit 2815ff1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM @holdenk

zero323 · 2017-04-24T19:16:33Z

python/pyspark/sql/column.py

@@ -251,15 +285,16 @@ def __iter__(self):

    # string methods
    _rlike_doc = """
-    Return a Boolean :class:`Column` based on a regex match.
+    SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex


Could you clarify that rlike uses not Python, but Java regular expressions?

Let's leave so that it indicates the regular expression is in SQL syntax. I would like to keep them identically in most cases to reduce the overhead when someone needs to fix the documentation across APIs in other languages.

It looks there are few more places that need the clarification (if needed). If this is something that has to be done, then, let's do this in another PR.

Problem is that in Scala or Java users get regular expressions dialect they expect. In Python they don't (for example with referencing groups).

But fair enough. Let's leave it for another time.

SparkQA · 2017-04-25T05:22:34Z

Test build #76129 has started for PR 17737 at commit eaeb456.

HyukjinKwon · 2017-04-25T05:30:07Z

Thank you for your review and approval @felixcheung, @zero323 and @map222.

map222 · 2017-04-25T05:39:02Z

python/pyspark/sql/column.py

@@ -527,7 +583,7 @@ def _test():
        .appName("sql.column tests")\
        .getOrCreate()
    sc = spark.sparkContext
-    globs['sc'] = sc
+    globs['spark'] = spark
    globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \


Do you want to update the globs['df'] definition to spark.createDataFrame?

Maybe we could. I think this is not related with Python documentation fix BTW.

HyukjinKwon · 2017-04-25T06:07:17Z

The point here is to add missing Python documentation and it matches them in Python's Column if there are some mismatches at bitwiseOR, bitwiseAND, bitwiseXOR contains, asc and desc among functions.py, column.py, functions.scala and Column.scala.

I hope other extra changes do not hold off this PR.

HyukjinKwon · 2017-04-25T07:10:55Z

retest this please

SparkQA · 2017-04-25T09:30:47Z

Test build #76130 has finished for PR 17737 at commit eaeb456.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-26T05:06:07Z

@holdenk do you have bandwidth to review this or ok with me pushing this to master?

HyukjinKwon · 2017-04-29T06:19:29Z

gentle ping @holdenk

gatorsmile · 2017-04-29T07:53:44Z

python/pyspark/sql/column.py

+                  this :class:`Column`.
+
+    >>> from pyspark.sql import Row
+    >>> df3 = spark.createDataFrame([Row(a=170, b=75)])


Why df3 instead of df?

I think there is a global df variable when running doctests and I guess it was avoided to shadow the same name from the outer scope in some doctests whereas other doctests just shadow it. I get your point. In documentation, we will only see only the code block and I guess using df might be slightly better.

AFAIK, usually, Python documentation have self-contained doctests in general so I don't know which case is better and correct. If you could confirm this, I can sweep it.

Let me correct this as probably I guess you prefer df and I don't have preference.

HyukjinKwon · 2017-04-29T10:01:23Z

@holdenk, @felixcheung and @gatorsmile, could this get merged?

SparkQA · 2017-04-29T12:14:40Z

Test build #76302 has finished for PR 17737 at commit 0fd9e37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-04-29T16:28:23Z

I can take a review look today, sorry this week has been so busy.

holdenk · 2017-04-29T20:39:48Z

Thank you for improving the documentation @HyukjinKwon looks good to me.

holdenk · 2017-04-29T20:41:52Z

And thanks to everyone for reviewing, I'll merge this to master :)

HyukjinKwon added 3 commits April 24, 2017 12:39

Fill up documentations for functions in Column API in PySpark

bb5de1f

Match asc/desc in functions.scala, Colum.scala, functions.py and colu…

92347de

…mn.py

Match functions.scala, Column.scala, functions.py and column.py

af8ac74

HyukjinKwon commented Apr 24, 2017

View reviewed changes

Consistent newlines

2815ff1

felixcheung approved these changes Apr 24, 2017

View reviewed changes

zero323 approved these changes Apr 24, 2017

View reviewed changes

Move _contains_doc up

eaeb456

map222 reviewed Apr 25, 2017

View reviewed changes

gatorsmile reviewed Apr 29, 2017

View reviewed changes

df# to df

0fd9e37

asfgit closed this in d228cd0 Apr 29, 2017

HyukjinKwon deleted the SPARK-20442 branch January 2, 2018 03:43

		@@ -269,17 +305,17 @@ def __iter__(self):
		[Row(age=2, name=u'Alice')]
		"""

[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737

[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737

Conversation

HyukjinKwon commented Apr 24, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

map222 Apr 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 24, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Apr 24, 2017

SparkQA commented Apr 24, 2017

SparkQA commented Apr 24, 2017

SparkQA commented Apr 24, 2017

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

zero323 Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 25, 2017

HyukjinKwon commented Apr 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 25, 2017 • edited Loading

HyukjinKwon commented Apr 25, 2017

SparkQA commented Apr 25, 2017

felixcheung commented Apr 26, 2017

HyukjinKwon commented Apr 29, 2017

Choose a reason for hiding this comment

HyukjinKwon Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 29, 2017

SparkQA commented Apr 29, 2017

holdenk commented Apr 29, 2017

holdenk commented Apr 29, 2017

holdenk commented Apr 29, 2017

HyukjinKwon commented Apr 24, 2017 •

edited

Loading

map222 Apr 24, 2017 •

edited

Loading

HyukjinKwon Apr 24, 2017 •

edited

Loading

HyukjinKwon Apr 25, 2017 •

edited

Loading

zero323 Apr 25, 2017 •

edited

Loading

HyukjinKwon commented Apr 25, 2017 •

edited

Loading

HyukjinKwon Apr 29, 2017 •

edited

Loading