Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 79 additions & 22 deletions python/pyspark/sql/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,17 +185,52 @@ def __contains__(self, item):
"in a string column or 'array_contains' function for an array column.")

# bitwise operators
bitwiseOR = _bin_op("bitwiseOR")
bitwiseAND = _bin_op("bitwiseAND")
bitwiseXOR = _bin_op("bitwiseXOR")
_bitwiseOR_doc = """
Compute bitwise OR of this expression with another expression.

:param other: a value or :class:`Column` to calculate bitwise or(|) against
this :class:`Column`.

>>> from pyspark.sql import Row
>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why df3 instead of df?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a global df variable when running doctests and I guess it was avoided to shadow the same name from the outer scope in some doctests whereas other doctests just shadow it. I get your point. In documentation, we will only see only the code block and I guess using df might be slightly better.

AFAIK, usually, Python documentation have self-contained doctests in general so I don't know which case is better and correct. If you could confirm this, I can sweep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me correct this as probably I guess you prefer df and I don't have preference.

>>> df3.select(df3.a.bitwiseOR(df3.b)).collect()
[Row((a | b)=235)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 22

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise OR of this expression with another expression


_bitwiseAND_doc = """
Compute bitwise AND of this expression with another expression.

:param other: a value or :class:`Column` to calculate bitwise and(&) against
this :class:`Column`.

>>> from pyspark.sql import Row
>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
>>> df3.select(df3.a.bitwiseAND(df3.b)).collect()
[Row((a & b)=10)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 26

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise AND of this expression with another expression


_bitwiseXOR_doc = """
Compute bitwise XOR of this expression with another expression.

:param other: a value or :class:`Column` to calculate bitwise xor(^) against
this :class:`Column`.

>>> from pyspark.sql import Row
>>> df3 = spark.createDataFrame([Row(a=170, b=75)])
>>> df3.select(df3.a.bitwiseXOR(df3.b)).collect()
[Row((a ^ b)=225)]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 43 31

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matched with Scala one.

Compute bitwise XOR of this expression with another expression

bitwiseOR = _bin_op("bitwiseOR", _bitwiseOR_doc)
bitwiseAND = _bin_op("bitwiseAND", _bitwiseAND_doc)
bitwiseXOR = _bin_op("bitwiseXOR", _bitwiseXOR_doc)

@since(1.3)
def getItem(self, key):
"""
An expression that gets an item at position ``ordinal`` out of a list,
or gets an item by key out of a dict.

>>> df = sc.parallelize([([1, 2], {"key": "value"})]).toDF(["l", "d"])
>>> df = spark.createDataFrame([([1, 2], {"key": "value"})], ["l", "d"])
>>> df.select(df.l.getItem(0), df.d.getItem("key")).show()
+----+------+
|l[0]|d[key]|
Expand All @@ -217,7 +252,7 @@ def getField(self, name):
An expression that gets a field by name in a StructField.

>>> from pyspark.sql import Row
>>> df = sc.parallelize([Row(r=Row(a=1, b="b"))]).toDF()
>>> df = spark.createDataFrame([Row(r=Row(a=1, b="b"))])
>>> df.select(df.r.getField("b")).show()
+---+
|r.b|
Expand Down Expand Up @@ -251,15 +286,16 @@ def __iter__(self):

# string methods
_rlike_doc = """
Return a Boolean :class:`Column` based on a regex match.
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify that rlike uses not Python, but Java regular expressions?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave so that it indicates the regular expression is in SQL syntax. I would like to keep them identically in most cases to reduce the overhead when someone needs to fix the documentation across APIs in other languages.

It looks there are few more places that need the clarification (if needed). If this is something that has to be done, then, let's do this in another PR.

Copy link
Member

@zero323 zero323 Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is that in Scala or Java users get regular expressions dialect they expect. In Python they don't (for example with referencing groups).

But fair enough. Let's leave it for another time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

match.

:param other: an extended regex expression

>>> df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 44 44

_like_doc = """
Return a Boolean :class:`Column` based on a SQL LIKE match.
SQL like expression. Returns a boolean :class:`Column` based on a SQL LIKE match.

:param other: a SQL LIKE pattern

Expand All @@ -269,17 +305,17 @@ def __iter__(self):
[Row(age=2, name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 45 10

_startswith_doc = """
Return a Boolean :class:`Column` based on a string match.
String starts with. Returns a boolean :class:`Column` based on a string match.

:param other: string at end of line (do not use a regex `^`)
:param other: string at start of line (do not use a regex `^`)

>>> df.filter(df.name.startswith('Al')).collect()
[Row(age=2, name=u'Alice')]
>>> df.filter(df.name.startswith('^Al')).collect()
[]
"""
_endswith_doc = """
Return a Boolean :class:`Column` based on matching end of string.
String ends with. Returns a boolean :class:`Column` based on a string match.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 45 36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be form the function above, but its correct in the code so no worries


:param other: string at end of line (do not use a regex `$`)

Expand All @@ -288,8 +324,16 @@ def __iter__(self):
>>> df.filter(df.name.endswith('ice$')).collect()
[]
"""
_contains_doc = """
Contains the other element. Returns a boolean :class:`Column` based on a string match.

:param other: string in line

>>> df.filter(df.name.contains('o')).collect()
[Row(age=5, name=u'Bob')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 45 57

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the _contains_doc be before the other docs, to match the order of the function declarations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


contains = _bin_op("contains")
contains = ignore_unicode_prefix(_bin_op("contains", _contains_doc))
rlike = ignore_unicode_prefix(_bin_op("rlike", _rlike_doc))
like = ignore_unicode_prefix(_bin_op("like", _like_doc))
startswith = ignore_unicode_prefix(_bin_op("startsWith", _startswith_doc))
Expand Down Expand Up @@ -337,26 +381,39 @@ def isin(self, *cols):
return Column(jc)

# order
asc = _unary_op("asc", "Returns a sort expression based on the"
" ascending order of the given column name.")
desc = _unary_op("desc", "Returns a sort expression based on the"
" descending order of the given column name.")
_asc_doc = """
Returns an ascending ordering used in sorting.

>>> from pyspark.sql import Row
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.select(df2.name).orderBy(df2.name.asc()).collect()
[Row(name=u'Alice'), Row(name=u'Tom')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 54 55

_desc_doc = """
Returns a descending ordering used in sorting.

>>> from pyspark.sql import Row
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.select(df2.name).orderBy(df2.name.desc()).collect()
[Row(name=u'Tom'), Row(name=u'Alice')]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-04-24 12 55 17


asc = ignore_unicode_prefix(_unary_op("asc", _asc_doc))
desc = ignore_unicode_prefix(_unary_op("desc", _desc_doc))

_isNull_doc = """
True if the current expression is null. Often combined with
:func:`DataFrame.filter` to select rows with null values.
Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often combined with :func:`DataFrame.filter` to select rows with null values. was removed because it looks applying to many other APIs and look too much. It just follows Scala one now.

True if the current expression is null.

>>> from pyspark.sql import Row
>>> df2 = sc.parallelize([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]).toDF()
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.filter(df2.height.isNull()).collect()
[Row(height=None, name=u'Alice')]
"""
_isNotNull_doc = """
True if the current expression is null. Often combined with
:func:`DataFrame.filter` to select rows with non-null values.
True if the current expression is NOT null.

>>> from pyspark.sql import Row
>>> df2 = sc.parallelize([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]).toDF()
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
>>> df2.filter(df2.height.isNotNull()).collect()
[Row(height=80, name=u'Tom')]
"""
Expand Down Expand Up @@ -527,7 +584,7 @@ def _test():
.appName("sql.column tests")\
.getOrCreate()
sc = spark.sparkContext
globs['sc'] = sc
globs['spark'] = spark
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed sc and replaced it to spark as I think we promote this way up to my knowledge.

globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update the globs['df'] definition to spark.createDataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could. I think this is not related with Python documentation fix BTW.

.toDF(StructType([StructField('age', IntegerType()),
StructField('name', StringType())]))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmet
}

/**
* A function that calculates bitwise xor of two numbers.
* A function that calculates bitwise xor({@literal ^}) of two numbers.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching it up with BitwiseAnd and BitwiseOr where

A function that calculates bitwise and(&) of two numbers.

A function that calculates bitwise or(|) of two numbers.

*
* Code generation inherited from BinaryArithmetic.
*/
Expand Down
19 changes: 10 additions & 9 deletions sql/core/src/main/scala/org/apache/spark/sql/Column.scala
Original file line number Diff line number Diff line change
Expand Up @@ -779,15 +779,16 @@ class Column(val expr: Expression) extends Logging {
def isin(list: Any*): Column = withExpr { In(expr, list.map(lit(_).expr)) }

/**
* SQL like expression.
* SQL like expression. Returns a boolean column based on a SQL LIKE match.
*
* @group expr_ops
* @since 1.3.0
*/
def like(literal: String): Column = withExpr { Like(expr, lit(literal).expr) }

/**
* SQL RLIKE expression (LIKE with Regex).
* SQL RLIKE expression (LIKE with Regex). Returns a boolean column based on a regex
* match.
*
* @group expr_ops
* @since 1.3.0
Expand Down Expand Up @@ -838,39 +839,39 @@ class Column(val expr: Expression) extends Logging {
}

/**
* Contains the other element.
* Contains the other element. Returns a boolean column based on a string match.
*
* @group expr_ops
* @since 1.3.0
*/
def contains(other: Any): Column = withExpr { Contains(expr, lit(other).expr) }

/**
* String starts with.
* String starts with. Returns a boolean column based on a string match.
*
* @group expr_ops
* @since 1.3.0
*/
def startsWith(other: Column): Column = withExpr { StartsWith(expr, lit(other).expr) }

/**
* String starts with another string literal.
* String starts with another string literal. Returns a boolean column based on a string match.
*
* @group expr_ops
* @since 1.3.0
*/
def startsWith(literal: String): Column = this.startsWith(lit(literal))

/**
* String ends with.
* String ends with. Returns a boolean column based on a string match.
*
* @group expr_ops
* @since 1.3.0
*/
def endsWith(other: Column): Column = withExpr { EndsWith(expr, lit(other).expr) }

/**
* String ends with another string literal.
* String ends with another string literal. Returns a boolean column based on a string match.
*
* @group expr_ops
* @since 1.3.0
Expand Down Expand Up @@ -1008,7 +1009,7 @@ class Column(val expr: Expression) extends Logging {
def cast(to: String): Column = cast(CatalystSqlParser.parseDataType(to))

/**
* Returns an ordering used in sorting.
* Returns a descending ordering used in sorting.
* {{{
* // Scala
* df.sort(df("age").desc)
Expand Down Expand Up @@ -1083,7 +1084,7 @@ class Column(val expr: Expression) extends Logging {
def asc_nulls_first: Column = withExpr { SortOrder(expr, Ascending, NullsFirst, Set.empty) }

/**
* Returns an ordering used in sorting, where null values appear after non-null values.
* Returns an ascending ordering used in sorting, where null values appear after non-null values.
* {{{
* // Scala: sort a DataFrame by age column in ascending order and null values appearing last.
* df.sort(df("age").asc_nulls_last)
Expand Down