Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737

Closed
wants to merge 6 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 24, 2017

What changes were proposed in this pull request?

This PR proposes to fill up the documentation with examples for bitwiseOR, bitwiseAND, bitwiseXOR. contains, asc and desc in Column API.

Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.

Lastly, this PR suggests to use spark rather than sc in doc tests in Column for Python documentation.

How was this patch tested?

Doc tests were added and manually tested with the commands below:

./python/run-tests.py --module pyspark-sql
./python/run-tests.py --module pyspark-sql --python-executable python3
./dev/lint-python

Output was checked via make html under ./python/docs. The snapshots will be left on the codes with comments.

Copy link
Member Author

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some images and comment to make review easier.

>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
>>> df3.select(df3.a.bitwiseOR(df3.b)).collect()
[Row((a | b)=235)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 22

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise OR of this expression with another expression

>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
>>> df3.select(df3.a.bitwiseAND(df3.b)).collect()
[Row((a & b)=10)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 26

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise AND of this expression with another expression

>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
>>> df3.select(df3.a.bitwiseXOR(df3.b)).collect()
[Row((a ^ b)=225)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 31

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise XOR of this expression with another expression

@@ -251,15 +286,16 @@ def __iter__(self):

# string methods
_rlike_doc = """
Return a Boolean :class:`Column` based on a regex match.
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex
match.

:param other: an extended regex expression

>>> df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 44 44

@@ -269,17 +305,17 @@ def __iter__(self):
[Row(age=2, name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 45 10

>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.select(df2.name).orderBy(df2.name.asc()).collect()
[Row(name=u'Alice'), Row(name=u'Tom')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 54 55

>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.select(df2.name).orderBy(df2.name.desc()).collect()
[Row(name=u'Tom'), Row(name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 55 17

@@ -527,7 +584,7 @@ def _test():
.appName("sql.column tests")\
.getOrCreate()
sc = spark.sparkContext
globs['sc'] = sc
globs['spark'] = spark
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed sc and replaced it to spark as I think we promote this way up to my knowledge.

@@ -86,7 +86,7 @@ case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmet
}

/**
* A function that calculates bitwise xor of two numbers.
* A function that calculates bitwise xor({@literal ^}) of two numbers.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching it up with BitwiseAnd and BitwiseOr where

A function that calculates bitwise and(&) of two numbers.

A function that calculates bitwise or(|) of two numbers.

@@ -1008,7 +1009,7 @@ class Column(val expr: Expression) extends Logging {
def cast(to: String): Column = cast(CatalystSqlParser.parseDataType(to))

/**
* Returns an ordering used in sorting.
* Returns a sort expression based on the descending order of the column.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the similar instances below are matched with functions.scala. They look calling the same ones.

Returns a sort expression based on the descending order of the column.

Returns a sort expression based on the descending order of the column,
and null values appear before non-null values.

Returns a sort expression based on the descending order of the column,
and null values appear after non-null values.

Copy link
Contributor

@map222 map222 Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include example usages, as in the python documentation? E.g. for rlike,

   * {{{
   *   // find names that start with "Al"
   *   scala> df.filter( $"name".like("Al%") ).collect()
   *   Array([Alice,1])
   * }}}

I made four examples for rlike, like, startsWith, and endsWith here: map222@28f97d3

It would also be helpful for the startsWith and endsWith functions, where there are two versions, e.g. startsWith(other: Column) and startsWith(literal: String).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that sounds good in a way but the downside of adding examples is to maintain and keep them up to date. Let's leave them out here as this PR targets to fix Python documentation.


_isNull_doc = """
True if the current expression is null. Often combined with
:func:`DataFrame.filter` to select rows with null values.
Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often combined with :func:`DataFrame.filter` to select rows with null values. was removed because it looks applying to many other APIs and look too much. It just follows Scala one now.

@HyukjinKwon
Copy link
Member Author

cc @srowen, @holdenk, @felixcheung, @map222 and @zero323 who were in related PRs.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76092 has finished for PR 17737 at commit bb5de1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76093 has finished for PR 17737 at commit af8ac74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76095 has finished for PR 17737 at commit 2815ff1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @holdenk

@@ -251,15 +285,16 @@ def __iter__(self):

# string methods
_rlike_doc = """
Return a Boolean :class:`Column` based on a regex match.
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify that rlike uses not Python, but Java regular expressions?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave so that it indicates the regular expression is in SQL syntax. I would like to keep them identically in most cases to reduce the overhead when someone needs to fix the documentation across APIs in other languages.

It looks there are few more places that need the clarification (if needed). If this is something that has to be done, then, let's do this in another PR.

Copy link
Member

@zero323 zero323 Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is that in Scala or Java users get regular expressions dialect they expect. In Python they don't (for example with referencing groups).

But fair enough. Let's leave it for another time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76129 has started for PR 17737 at commit eaeb456.

@HyukjinKwon
Copy link
Member Author

Thank you for your review and approval @felixcheung, @zero323 and @map222.

@@ -527,7 +583,7 @@ def _test():
.appName("sql.column tests")\
.getOrCreate()
sc = spark.sparkContext
globs['sc'] = sc
globs['spark'] = spark
globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update the globs['df'] definition to spark.createDataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could. I think this is not related with Python documentation fix BTW.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 25, 2017

The point here is to add missing Python documentation and it matches them in Python's Column if there are some mismatches at bitwiseOR, bitwiseAND, bitwiseXOR contains, asc and desc among functions.py, column.py, functions.scala and Column.scala.

I hope other extra changes do not hold off this PR.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76130 has finished for PR 17737 at commit eaeb456.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@holdenk do you have bandwidth to review this or ok with me pushing this to master?

@HyukjinKwon
Copy link
Member Author

gentle ping @holdenk

this :class:`Column`.

>>> from pyspark.sql import Row
>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why df3 instead of df?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a global df variable when running doctests and I guess it was avoided to shadow the same name from the outer scope in some doctests whereas other doctests just shadow it. I get your point. In documentation, we will only see only the code block and I guess using df might be slightly better.

AFAIK, usually, Python documentation have self-contained doctests in general so I don't know which case is better and correct. If you could confirm this, I can sweep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me correct this as probably I guess you prefer df and I don't have preference.

@HyukjinKwon
Copy link
Member Author

@holdenk, @felixcheung and @gatorsmile, could this get merged?

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76302 has finished for PR 17737 at commit 0fd9e37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 29, 2017

I can take a review look today, sorry this week has been so busy.

@holdenk
Copy link
Contributor

holdenk commented Apr 29, 2017

Thank you for improving the documentation @HyukjinKwon looks good to me.

@holdenk
Copy link
Contributor

holdenk commented Apr 29, 2017

And thanks to everyone for reviewing, I'll merge this to master :)

@asfgit asfgit closed this in d228cd0 Apr 29, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-20442 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants