-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861
Conversation
I think the explain modes look useful for debugging, but I'm not sure that this fix to add an optional param in explain is a right approach. Could you check this, @HyukjinKwon @viirya ? |
* @group basic | ||
* @since 3.0.0 | ||
*/ | ||
def explain(mode: ExplainMode): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think there's a similar case at mode
at DataFrameWriter
. How about we have explain(mode: String)
only instead of explain(mode: ExplainMode)
? For enum one, I am not sure actually yet (e.g., joinType
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. So, you mean ExplainMode
is internally used only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my thinking. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, to me, the string argument looks more useful than enum (cuz we don't import anything for that and that interface is easy-to-use from python). But, we might need more comments about this. cc: @cloud-fan @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But SaveMode
is public. We can have both explain(String)
and explain(ExplainMode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But joinType
doesn't expose enums as an example. ExplainMode
was added in Spark 3.0 so we don't necessarily expose another API. Actually, isn't using string easier given that explain will be used in a debugging purpose more often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you guys don't feel strongly, can we just have explain(String)
alone for now? I somewhat feel a bit strong that we should better start from fewer APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I like fewer API designs and I think enum arguments doesn't matter for python/R users.
+1 |
Test build #115223 has finished for PR 26861 at commit
|
0def203
to
60071fe
Compare
Test build #115241 has finished for PR 26861 at commit
|
Test build #115243 has finished for PR 26861 at commit
|
60071fe
to
56509fc
Compare
@@ -253,10 +253,18 @@ def printSchema(self): | |||
print(self._jdf.schema().treeString()) | |||
|
|||
@since(1.3) | |||
def explain(self, extended=False): | |||
def explain(self, extended=None, mode=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here LGTM
Test build #115257 has finished for PR 26861 at commit
|
Let me merge and address my own comment #26861 (comment) separately since this PR targets to add a Python corresponding API. |
Merged to master. |
Thanks, @HyukjinKwon!
Yea, we need more time to discuss that API design. |
I filed jira for the explain mode of SparkR; https://issues.apache.org/jira/browse/SPARK-30255 |
@@ -253,10 +253,18 @@ def printSchema(self): | |||
print(self._jdf.schema().treeString()) | |||
|
|||
@since(1.3) | |||
def explain(self, extended=False): | |||
def explain(self, extended=None, mode=None): | |||
"""Prints the (logical and physical) plans to the console for debugging purpose. | |||
|
|||
:param extended: boolean, default ``False``. If ``False``, prints only the physical plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I did the same thing with sample
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I meant we should update this param doc. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the description about the default is the same with withReplacement
in sample even if withReplacement=None
? https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L838
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change it together? @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll submit a PR to fix up tomorrow together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I noticed your comment now, @HyukjinKwon. If we still need minor fixes on dataframe.py
, can you have more follow-ups? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this one, default value is fine. Although it's None
, it works like false
. It's just to support different combinations of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's "it works like false"...., so I felt its difficult to explain about that in the param docs...
argtypes = [ | ||
str(type(arg)) for arg in [extended, mode] if arg is not None] | ||
raise TypeError( | ||
"extended (optional) and mode (optional) should be a bool and str; " | ||
"however, got [%s]." % ", ".join(argtypes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only a wrong mode is given, this will print like:
extended (optional) and mode (optional) should be a bool and str; however, got wrong_type_for_mode.
Not big deal, but maybe only print out corresponding arg name if it is given wrong type.
@@ -550,8 +545,29 @@ class Dataset[T] private[sql]( | |||
case ExplainMode.Formatted => | |||
qe.simpleString(formatted = true) | |||
} | |||
} | |||
|
|||
private[sql] def toExplainString(mode: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only for Python to call? I think it is better to add short comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. ok, I'll follow up, thanks for the comment.
LGTM too. Thanks @maropu for this work! |
…park df.explain ### What changes were proposed in this pull request? This pr is a followup of #26861 to address minor comments from viirya. ### Why are the changes needed? For better error messages. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26886 from maropu/SPARK-30231-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This pr intends to support explain modes implemented in #26829 for PySpark.
Why are the changes needed?
For better debugging info. in PySpark dataframes.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added UTs.