Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 49 additions & 5 deletions python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,10 +253,18 @@ def printSchema(self):
print(self._jdf.schema().treeString())

@since(1.3)
def explain(self, extended=False):
def explain(self, extended=None, mode=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here LGTM

"""Prints the (logical and physical) plans to the console for debugging purpose.

:param extended: boolean, default ``False``. If ``False``, prints only the physical plan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I did the same thing with sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I meant we should update this param doc. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

Copy link
Member Author

@maropu maropu Dec 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the description about the default is the same with withReplacement in sample even if withReplacement=None? https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L838

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change it together? @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll submit a PR to fix up tomorrow together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I noticed your comment now, @HyukjinKwon. If we still need minor fixes on dataframe.py, can you have more follow-ups? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this one, default value is fine. Although it's None, it works like false. It's just to support different combinations of arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's "it works like false"...., so I felt its difficult to explain about that in the param docs...

:param mode: specifies the expected output format of plans.

* ``simple``: Print only a physical plan.
* ``extended``: Print both logical and physical plans.
* ``codegen``: Print a physical plan and generated codes if they are available.
* ``cost``: Print a logical plan and statistics if they are available.
* ``formatted``: Split explain output into two sections: a physical plan outline \
and node details.

>>> df.explain()
== Physical Plan ==
Expand All @@ -271,11 +279,47 @@ def explain(self, extended=False):
...
== Physical Plan ==
...

>>> df.explain(mode="formatted")
== Physical Plan ==
* Scan ExistingRDD (1)
(1) Scan ExistingRDD [codegen id : 1]
Output: [age#0, name#1]

.. versionchanged:: 3.0.0
Added optional argument `mode` to specify the expected output format of plans.
"""
if extended:
print(self._jdf.queryExecution().toString())
else:
print(self._jdf.queryExecution().simpleString())

if extended is not None and mode is not None:
raise Exception("extended and mode can not be specified simultaneously")

# For the no argument case: df.explain()
is_no_argument = extended is None and mode is None

# For the cases below:
# explain(True)
# explain(extended=False)
is_extended_case = extended is not None and isinstance(extended, bool)

# For the mode specified: df.explain(mode="formatted")
is_mode_case = mode is not None and isinstance(mode, basestring)

if not is_no_argument and not (is_extended_case or is_mode_case):
argtypes = [
str(type(arg)) for arg in [extended, mode] if arg is not None]
raise TypeError(
"extended (optional) and mode (optional) should be a bool and str; "
"however, got [%s]." % ", ".join(argtypes))
Comment on lines +308 to +312
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a wrong mode is given, this will print like:

extended (optional) and mode (optional) should be a bool and str; however, got wrong_type_for_mode.

Not big deal, but maybe only print out corresponding arg name if it is given wrong type.


# Sets an explain mode depending on a given argument
if is_no_argument:
explainMode = "simple"
elif is_extended_case:
explainMode = "extended" if extended else "simple"
elif is_mode_case:
explainMode = mode

print(self._jdf.toExplainString(explainMode))

@since(2.4)
def exceptAll(self, other):
Expand Down
34 changes: 25 additions & 9 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
package org.apache.spark.sql

import java.io.{ByteArrayOutputStream, CharArrayWriter, DataOutputStream}
import java.util.Locale

import scala.collection.JavaConverters._
import scala.collection.mutable.ArrayBuffer
Expand Down Expand Up @@ -521,20 +522,14 @@ class Dataset[T] private[sql](
def printSchema(level: Int): Unit = println(schema.treeString(level))
// scalastyle:on println

/**
* Prints the plans (logical and physical) with a format specified by a given explain mode.
*
* @group basic
* @since 3.0.0
*/
def explain(mode: ExplainMode): Unit = {
private def toExplainString(mode: ExplainMode): String = {
// Because temporary views are resolved during analysis when we create a Dataset, and
// `ExplainCommand` analyzes input query plan and resolves temporary views again. Using
// `ExplainCommand` here will probably output different query plans, compared to the results
// of evaluation of the Dataset. So just output QueryExecution's query plans here.
val qe = ExplainCommandUtil.explainedQueryExecution(sparkSession, logicalPlan, queryExecution)

val outputString = mode match {
mode match {
case ExplainMode.Simple =>
qe.simpleString
case ExplainMode.Extended =>
Expand All @@ -550,8 +545,29 @@ class Dataset[T] private[sql](
case ExplainMode.Formatted =>
qe.simpleString(formatted = true)
}
}

private[sql] def toExplainString(mode: String): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for Python to call? I think it is better to add short comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. ok, I'll follow up, thanks for the comment.

mode.toLowerCase(Locale.ROOT) match {
case "simple" => toExplainString(ExplainMode.Simple)
case "extended" => toExplainString(ExplainMode.Extended)
case "codegen" => toExplainString(ExplainMode.Codegen)
case "cost" => toExplainString(ExplainMode.Cost)
case "formatted" => toExplainString(ExplainMode.Formatted)
case _ => throw new IllegalArgumentException(s"Unknown explain mode: $mode. Accepted " +
"explain modes are 'simple', 'extended', 'codegen', 'cost', 'formatted'.")
}
}

/**
* Prints the plans (logical and physical) with a format specified by a given explain mode.
*
* @group basic
* @since 3.0.0
*/
def explain(mode: ExplainMode): Unit = {
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there's a similar case at mode at DataFrameWriter. How about we have explain(mode: String) only instead of explain(mode: ExplainMode)? For enum one, I am not sure actually yet (e.g., joinType).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. So, you mean ExplainMode is internally used only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my thinking. WDYT?

Copy link
Member Author

@maropu maropu Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, to me, the string argument looks more useful than enum (cuz we don't import anything for that and that interface is easy-to-use from python). But, we might need more comments about this. cc: @cloud-fan @dongjoon-hyun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But SaveMode is public. We can have both explain(String) and explain(ExplainMode)

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But joinType doesn't expose enums as an example. ExplainMode was added in Spark 3.0 so we don't necessarily expose another API. Actually, isn't using string easier given that explain will be used in a debugging purpose more often?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys don't feel strongly, can we just have explain(String) alone for now? I somewhat feel a bit strong that we should better start from fewer APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I like fewer API designs and I think enum arguments doesn't matter for python/R users.

// scalastyle:off println
println(outputString)
println(toExplainString(mode))
// scalastyle:on println
}

Expand Down
18 changes: 18 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,24 @@ class ExplainSuite extends QueryTest with SharedSparkSession {
"(1) LocalTableScan [codegen id :" ::
Nil: _*)
}

test("Dataset.toExplainString has mode as string") {
val df = spark.range(10).toDF
def assertExplainOutput(mode: ExplainMode): Unit = {
assert(df.toExplainString(mode.toString).replaceAll("#\\d+", "#x").trim ===
getNormalizedExplain(df, mode).trim)
}
assertExplainOutput(ExplainMode.Simple)
assertExplainOutput(ExplainMode.Extended)
assertExplainOutput(ExplainMode.Codegen)
assertExplainOutput(ExplainMode.Cost)
assertExplainOutput(ExplainMode.Formatted)

val errMsg = intercept[IllegalArgumentException] {
df.toExplainString("unknown")
}.getMessage
assert(errMsg.contains("Unknown explain mode: unknown"))
}
}

case class ExplainSingleData(id: Int)