-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark #17737
Changes from 1 commit
bb5de1f
92347de
af8ac74
2815ff1
eaeb456
0fd9e37
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -185,17 +185,52 @@ def __contains__(self, item): | |
"in a string column or 'array_contains' function for an array column.") | ||
|
||
# bitwise operators | ||
bitwiseOR = _bin_op("bitwiseOR") | ||
bitwiseAND = _bin_op("bitwiseAND") | ||
bitwiseXOR = _bin_op("bitwiseXOR") | ||
_bitwiseOR_doc = """ | ||
Compute bitwise OR of this expression with another expression. | ||
|
||
:param other: a value or :class:`Column` to calculate bitwise or(|) against | ||
this :class:`Column`. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseOR(df3.b)).collect() | ||
[Row((a | b)=235)] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is matched with Scala one.
|
||
|
||
_bitwiseAND_doc = """ | ||
Compute bitwise AND of this expression with another expression. | ||
|
||
:param other: a value or :class:`Column` to calculate bitwise and(&) against | ||
this :class:`Column`. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseAND(df3.b)).collect() | ||
[Row((a & b)=10)] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is matched with Scala one.
|
||
|
||
_bitwiseXOR_doc = """ | ||
Compute bitwise XOR of this expression with another expression. | ||
|
||
:param other: a value or :class:`Column` to calculate bitwise xor(^) against | ||
this :class:`Column`. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df3 = spark.createDataFrame([Row(a=170, b=75)]) | ||
>>> df3.select(df3.a.bitwiseXOR(df3.b)).collect() | ||
[Row((a ^ b)=225)] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is matched with Scala one.
|
||
bitwiseOR = _bin_op("bitwiseOR", _bitwiseOR_doc) | ||
bitwiseAND = _bin_op("bitwiseAND", _bitwiseAND_doc) | ||
bitwiseXOR = _bin_op("bitwiseXOR", _bitwiseXOR_doc) | ||
|
||
@since(1.3) | ||
def getItem(self, key): | ||
""" | ||
An expression that gets an item at position ``ordinal`` out of a list, | ||
or gets an item by key out of a dict. | ||
|
||
>>> df = sc.parallelize([([1, 2], {"key": "value"})]).toDF(["l", "d"]) | ||
>>> df = spark.createDataFrame([([1, 2], {"key": "value"})], ["l", "d"]) | ||
>>> df.select(df.l.getItem(0), df.d.getItem("key")).show() | ||
+----+------+ | ||
|l[0]|d[key]| | ||
|
@@ -217,7 +252,7 @@ def getField(self, name): | |
An expression that gets a field by name in a StructField. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df = sc.parallelize([Row(r=Row(a=1, b="b"))]).toDF() | ||
>>> df = spark.createDataFrame([Row(r=Row(a=1, b="b"))]) | ||
>>> df.select(df.r.getField("b")).show() | ||
+---+ | ||
|r.b| | ||
|
@@ -251,15 +286,16 @@ def __iter__(self): | |
|
||
# string methods | ||
_rlike_doc = """ | ||
Return a Boolean :class:`Column` based on a regex match. | ||
SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` based on a regex | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you clarify that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's leave so that it indicates the regular expression is in SQL syntax. I would like to keep them identically in most cases to reduce the overhead when someone needs to fix the documentation across APIs in other languages. It looks there are few more places that need the clarification (if needed). If this is something that has to be done, then, let's do this in another PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Problem is that in Scala or Java users get regular expressions dialect they expect. In Python they don't (for example with referencing groups). But fair enough. Let's leave it for another time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. |
||
match. | ||
|
||
:param other: an extended regex expression | ||
|
||
>>> df.filter(df.name.rlike('ice$')).collect() | ||
[Row(age=2, name=u'Alice')] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
_like_doc = """ | ||
Return a Boolean :class:`Column` based on a SQL LIKE match. | ||
SQL like expression. Returns a boolean :class:`Column` based on a SQL LIKE match. | ||
|
||
:param other: a SQL LIKE pattern | ||
|
||
|
@@ -269,17 +305,17 @@ def __iter__(self): | |
[Row(age=2, name=u'Alice')] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
_startswith_doc = """ | ||
Return a Boolean :class:`Column` based on a string match. | ||
String starts with. Returns a boolean :class:`Column` based on a string match. | ||
|
||
:param other: string at end of line (do not use a regex `^`) | ||
:param other: string at start of line (do not use a regex `^`) | ||
|
||
>>> df.filter(df.name.startswith('Al')).collect() | ||
[Row(age=2, name=u'Alice')] | ||
>>> df.filter(df.name.startswith('^Al')).collect() | ||
[] | ||
""" | ||
_endswith_doc = """ | ||
Return a Boolean :class:`Column` based on matching end of string. | ||
String ends with. Returns a boolean :class:`Column` based on a string match. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be form the function above, but its correct in the code so no worries |
||
|
||
:param other: string at end of line (do not use a regex `$`) | ||
|
||
|
@@ -288,8 +324,16 @@ def __iter__(self): | |
>>> df.filter(df.name.endswith('ice$')).collect() | ||
[] | ||
""" | ||
_contains_doc = """ | ||
Contains the other element. Returns a boolean :class:`Column` based on a string match. | ||
|
||
:param other: string in line | ||
|
||
>>> df.filter(df.name.contains('o')).collect() | ||
[Row(age=5, name=u'Bob')] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. |
||
|
||
contains = _bin_op("contains") | ||
contains = ignore_unicode_prefix(_bin_op("contains", _contains_doc)) | ||
rlike = ignore_unicode_prefix(_bin_op("rlike", _rlike_doc)) | ||
like = ignore_unicode_prefix(_bin_op("like", _like_doc)) | ||
startswith = ignore_unicode_prefix(_bin_op("startsWith", _startswith_doc)) | ||
|
@@ -337,26 +381,39 @@ def isin(self, *cols): | |
return Column(jc) | ||
|
||
# order | ||
asc = _unary_op("asc", "Returns a sort expression based on the" | ||
" ascending order of the given column name.") | ||
desc = _unary_op("desc", "Returns a sort expression based on the" | ||
" descending order of the given column name.") | ||
_asc_doc = """ | ||
Returns an ascending ordering used in sorting. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.select(df2.name).orderBy(df2.name.asc()).collect() | ||
[Row(name=u'Alice'), Row(name=u'Tom')] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
_desc_doc = """ | ||
Returns a descending ordering used in sorting. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.select(df2.name).orderBy(df2.name.desc()).collect() | ||
[Row(name=u'Tom'), Row(name=u'Alice')] | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
asc = ignore_unicode_prefix(_unary_op("asc", _asc_doc)) | ||
desc = ignore_unicode_prefix(_unary_op("desc", _desc_doc)) | ||
|
||
_isNull_doc = """ | ||
True if the current expression is null. Often combined with | ||
:func:`DataFrame.filter` to select rows with null values. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
True if the current expression is null. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df2 = sc.parallelize([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]).toDF() | ||
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.filter(df2.height.isNull()).collect() | ||
[Row(height=None, name=u'Alice')] | ||
""" | ||
_isNotNull_doc = """ | ||
True if the current expression is null. Often combined with | ||
:func:`DataFrame.filter` to select rows with non-null values. | ||
True if the current expression is NOT null. | ||
|
||
>>> from pyspark.sql import Row | ||
>>> df2 = sc.parallelize([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]).toDF() | ||
>>> df2 = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) | ||
>>> df2.filter(df2.height.isNotNull()).collect() | ||
[Row(height=80, name=u'Tom')] | ||
""" | ||
|
@@ -527,7 +584,7 @@ def _test(): | |
.appName("sql.column tests")\ | ||
.getOrCreate() | ||
sc = spark.sparkContext | ||
globs['sc'] = sc | ||
globs['spark'] = spark | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed |
||
globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to update the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could. I think this is not related with Python documentation fix BTW. |
||
.toDF(StructType([StructField('age', IntegerType()), | ||
StructField('name', StringType())])) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,7 +86,7 @@ case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmet | |
} | ||
|
||
/** | ||
* A function that calculates bitwise xor of two numbers. | ||
* A function that calculates bitwise xor({@literal ^}) of two numbers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Matching it up with
|
||
* | ||
* Code generation inherited from BinaryArithmetic. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why
df3
instead ofdf
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a global
df
variable when running doctests and I guess it was avoided to shadow the same name from the outer scope in some doctests whereas other doctests just shadow it. I get your point. In documentation, we will only see only the code block and I guess usingdf
might be slightly better.AFAIK, usually, Python documentation have self-contained doctests in general so I don't know which case is better and correct. If you could confirm this, I can sweep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me correct this as probably I guess you prefer
df
and I don't have preference.