-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -253,10 +253,18 @@ def printSchema(self): | |
print(self._jdf.schema().treeString()) | ||
|
||
@since(1.3) | ||
def explain(self, extended=False): | ||
def explain(self, extended=None, mode=None): | ||
"""Prints the (logical and physical) plans to the console for debugging purpose. | ||
|
||
:param extended: boolean, default ``False``. If ``False``, prints only the physical plan. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. default None? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, I did the same thing with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, I meant we should update this param doc. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But, the description about the default is the same with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe change it together? @HyukjinKwon There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I'll submit a PR to fix up tomorrow together. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I noticed your comment now, @HyukjinKwon. If we still need minor fixes on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, this one, default value is fine. Although it's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, that's "it works like false"...., so I felt its difficult to explain about that in the param docs... |
||
:param mode: specifies the expected output format of plans. | ||
|
||
* ``simple``: Print only a physical plan. | ||
* ``extended``: Print both logical and physical plans. | ||
* ``codegen``: Print a physical plan and generated codes if they are available. | ||
* ``cost``: Print a logical plan and statistics if they are available. | ||
* ``formatted``: Split explain output into two sections: a physical plan outline \ | ||
and node details. | ||
|
||
>>> df.explain() | ||
== Physical Plan == | ||
|
@@ -271,11 +279,47 @@ def explain(self, extended=False): | |
... | ||
== Physical Plan == | ||
... | ||
|
||
>>> df.explain(mode="formatted") | ||
== Physical Plan == | ||
* Scan ExistingRDD (1) | ||
(1) Scan ExistingRDD [codegen id : 1] | ||
Output: [age#0, name#1] | ||
|
||
.. versionchanged:: 3.0.0 | ||
Added optional argument `mode` to specify the expected output format of plans. | ||
""" | ||
if extended: | ||
print(self._jdf.queryExecution().toString()) | ||
else: | ||
print(self._jdf.queryExecution().simpleString()) | ||
|
||
if extended is not None and mode is not None: | ||
raise Exception("extended and mode can not be specified simultaneously") | ||
|
||
# For the no argument case: df.explain() | ||
is_no_argument = extended is None and mode is None | ||
|
||
# For the cases below: | ||
# explain(True) | ||
# explain(extended=False) | ||
is_extended_case = extended is not None and isinstance(extended, bool) | ||
|
||
# For the mode specified: df.explain(mode="formatted") | ||
is_mode_case = mode is not None and isinstance(mode, basestring) | ||
|
||
if not is_no_argument and not (is_extended_case or is_mode_case): | ||
argtypes = [ | ||
str(type(arg)) for arg in [extended, mode] if arg is not None] | ||
raise TypeError( | ||
"extended (optional) and mode (optional) should be a bool and str; " | ||
"however, got [%s]." % ", ".join(argtypes)) | ||
Comment on lines
+308
to
+312
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If only a wrong mode is given, this will print like:
Not big deal, but maybe only print out corresponding arg name if it is given wrong type. |
||
|
||
# Sets an explain mode depending on a given argument | ||
if is_no_argument: | ||
explainMode = "simple" | ||
elif is_extended_case: | ||
explainMode = "extended" if extended else "simple" | ||
elif is_mode_case: | ||
explainMode = mode | ||
|
||
print(self._jdf.toExplainString(explainMode)) | ||
|
||
@since(2.4) | ||
def exceptAll(self, other): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
package org.apache.spark.sql | ||
|
||
import java.io.{ByteArrayOutputStream, CharArrayWriter, DataOutputStream} | ||
import java.util.Locale | ||
|
||
import scala.collection.JavaConverters._ | ||
import scala.collection.mutable.ArrayBuffer | ||
|
@@ -521,20 +522,14 @@ class Dataset[T] private[sql]( | |
def printSchema(level: Int): Unit = println(schema.treeString(level)) | ||
// scalastyle:on println | ||
|
||
/** | ||
* Prints the plans (logical and physical) with a format specified by a given explain mode. | ||
* | ||
* @group basic | ||
* @since 3.0.0 | ||
*/ | ||
def explain(mode: ExplainMode): Unit = { | ||
private def toExplainString(mode: ExplainMode): String = { | ||
// Because temporary views are resolved during analysis when we create a Dataset, and | ||
// `ExplainCommand` analyzes input query plan and resolves temporary views again. Using | ||
// `ExplainCommand` here will probably output different query plans, compared to the results | ||
// of evaluation of the Dataset. So just output QueryExecution's query plans here. | ||
val qe = ExplainCommandUtil.explainedQueryExecution(sparkSession, logicalPlan, queryExecution) | ||
|
||
val outputString = mode match { | ||
mode match { | ||
case ExplainMode.Simple => | ||
qe.simpleString | ||
case ExplainMode.Extended => | ||
|
@@ -550,8 +545,29 @@ class Dataset[T] private[sql]( | |
case ExplainMode.Formatted => | ||
qe.simpleString(formatted = true) | ||
} | ||
} | ||
|
||
private[sql] def toExplainString(mode: String): String = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this only for Python to call? I think it is better to add short comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes. ok, I'll follow up, thanks for the comment. |
||
mode.toLowerCase(Locale.ROOT) match { | ||
case "simple" => toExplainString(ExplainMode.Simple) | ||
case "extended" => toExplainString(ExplainMode.Extended) | ||
case "codegen" => toExplainString(ExplainMode.Codegen) | ||
case "cost" => toExplainString(ExplainMode.Cost) | ||
case "formatted" => toExplainString(ExplainMode.Formatted) | ||
case _ => throw new IllegalArgumentException(s"Unknown explain mode: $mode. Accepted " + | ||
"explain modes are 'simple', 'extended', 'codegen', 'cost', 'formatted'.") | ||
} | ||
} | ||
|
||
/** | ||
* Prints the plans (logical and physical) with a format specified by a given explain mode. | ||
* | ||
* @group basic | ||
* @since 3.0.0 | ||
*/ | ||
def explain(mode: ExplainMode): Unit = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think there's a similar case at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aha, I see. So, you mean There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that was my thinking. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, to me, the string argument looks more useful than enum (cuz we don't import anything for that and that interface is easy-to-use from python). But, we might need more comments about this. cc: @cloud-fan @dongjoon-hyun There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you guys don't feel strongly, can we just have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, I like fewer API designs and I think enum arguments doesn't matter for python/R users. |
||
// scalastyle:off println | ||
println(outputString) | ||
println(toExplainString(mode)) | ||
// scalastyle:on println | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here LGTM