[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

maropu · 2019-12-12T07:43:08Z

What changes were proposed in this pull request?

This pr intends to support explain modes implemented in #26829 for PySpark.

Why are the changes needed?

For better debugging info. in PySpark dataframes.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs.

maropu · 2019-12-12T07:45:05Z

I think the explain modes look useful for debugging, but I'm not sure that this fix to add an optional param in explain is a right approach. Could you check this, @HyukjinKwon @viirya ?

python/pyspark/sql/dataframe.py

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

HyukjinKwon · 2019-12-12T09:14:41Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group basic
+   * @since 3.0.0
+   */
+  def explain(mode: ExplainMode): Unit = {


Yes, I think there's a similar case at mode at DataFrameWriter. How about we have explain(mode: String) only instead of explain(mode: ExplainMode)? For enum one, I am not sure actually yet (e.g., joinType).

Aha, I see. So, you mean ExplainMode is internally used only?

Yes, that was my thinking. WDYT?

Yea, to me, the string argument looks more useful than enum (cuz we don't import anything for that and that interface is easy-to-use from python). But, we might need more comments about this. cc: @cloud-fan @dongjoon-hyun

But SaveMode is public. We can have both explain(String) and explain(ExplainMode)

But joinType doesn't expose enums as an example. ExplainMode was added in Spark 3.0 so we don't necessarily expose another API. Actually, isn't using string easier given that explain will be used in a debugging purpose more often?

If you guys don't feel strongly, can we just have explain(String) alone for now? I somewhat feel a bit strong that we should better start from fewer APIs.

yea, I like fewer API designs and I think enum arguments doesn't matter for python/R users.

python/pyspark/sql/dataframe.py

HyukjinKwon · 2019-12-12T09:25:19Z

+1

SparkQA · 2019-12-12T10:02:38Z

Test build #115223 has finished for PR 26861 at commit 295e840.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-12T14:42:20Z

Test build #115241 has finished for PR 26861 at commit 0def203.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-12T14:50:28Z

Test build #115243 has finished for PR 26861 at commit 60071fe.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/dataframe.py

HyukjinKwon · 2019-12-13T01:40:49Z

python/pyspark/sql/dataframe.py

@@ -253,10 +253,18 @@ def printSchema(self):
        print(self._jdf.schema().treeString())

    @since(1.3)
-    def explain(self, extended=False):
+    def explain(self, extended=None, mode=None):


SparkQA · 2019-12-13T01:57:16Z

Test build #115257 has finished for PR 26861 at commit 56509fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-13T08:43:23Z

Let me merge and address my own comment #26861 (comment) separately since this PR targets to add a Python corresponding API.

HyukjinKwon · 2019-12-13T08:43:58Z

Merged to master.

maropu · 2019-12-13T12:37:22Z

Thanks, @HyukjinKwon!

Let me merge and address my own comment #26861 (comment) separately since this PR targets to add a Python corresponding API.

Yea, we need more time to discuss that API design.

maropu · 2019-12-13T12:52:36Z

I filed jira for the explain mode of SparkR; https://issues.apache.org/jira/browse/SPARK-30255
But, since I'm not familiar with R code, I won't work on that.

viirya · 2019-12-13T16:22:09Z

python/pyspark/sql/dataframe.py

@@ -253,10 +253,18 @@ def printSchema(self):
        print(self._jdf.schema().treeString())

    @since(1.3)
-    def explain(self, extended=False):
+    def explain(self, extended=None, mode=None):
        """Prints the (logical and physical) plans to the console for debugging purpose.

        :param extended: boolean, default ``False``. If ``False``, prints only the physical plan.


default None?

Yea, I did the same thing with sample.

oh, I meant we should update this param doc. :)

But, the description about the default is the same with withReplacement in sample even if withReplacement=None? https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L838

Maybe change it together? @HyukjinKwon

Yeah, I'll submit a PR to fix up tomorrow together.

Oh, I noticed your comment now, @HyukjinKwon. If we still need minor fixes on dataframe.py, can you have more follow-ups? Thanks!

Wait, this one, default value is fine. Although it's None, it works like false. It's just to support different combinations of arguments.

Yea, that's "it works like false"...., so I felt its difficult to explain about that in the param docs...

viirya · 2019-12-13T16:28:35Z

python/pyspark/sql/dataframe.py

+            argtypes = [
+                str(type(arg)) for arg in [extended, mode] if arg is not None]
+            raise TypeError(
+                "extended (optional) and mode (optional) should be a bool and str; "
+                "however, got [%s]." % ", ".join(argtypes))


If only a wrong mode is given, this will print like:

extended (optional) and mode (optional) should be a bool and str; however, got wrong_type_for_mode.

Not big deal, but maybe only print out corresponding arg name if it is given wrong type.

viirya · 2019-12-13T16:30:23Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -550,8 +545,29 @@ class Dataset[T] private[sql](
      case ExplainMode.Formatted =>
        qe.simpleString(formatted = true)
    }
+  }
+
+  private[sql] def toExplainString(mode: String): String = {


Is this only for Python to call? I think it is better to add short comment.

Ah, yes. ok, I'll follow up, thanks for the comment.

viirya · 2019-12-13T16:31:27Z

LGTM too. Thanks @maropu for this work!

…park df.explain ### What changes were proposed in this pull request? This pr is a followup of #26861 to address minor comments from viirya. ### Why are the changes needed? For better error messages. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26886 from maropu/SPARK-30231-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Fix

295e840

HyukjinKwon reviewed Dec 12, 2019

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 12, 2019

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

maropu force-pushed the ExplainModeInPython branch from 0def203 to 60071fe Compare December 12, 2019 14:40

Fix

56509fc

dongjoon-hyun reviewed Dec 12, 2019

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

maropu force-pushed the ExplainModeInPython branch from 60071fe to 56509fc Compare December 12, 2019 22:05

dongjoon-hyun added PYSPARK SQL labels Dec 12, 2019

HyukjinKwon reviewed Dec 13, 2019

View reviewed changes

HyukjinKwon approved these changes Dec 13, 2019

View reviewed changes

HyukjinKwon closed this in 64c7b94 Dec 13, 2019

viirya reviewed Dec 13, 2019

View reviewed changes

maropu mentioned this pull request Dec 14, 2019

[SPARK-30231][SQL][PYTHON][FOLLOWUP] Make error messages clear in PySpark df.explain #26886

Closed

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

maropu commented Dec 12, 2019

maropu commented Dec 12, 2019

HyukjinKwon Dec 12, 2019 •

edited

Loading

maropu Dec 12, 2019

HyukjinKwon Dec 12, 2019

maropu Dec 12, 2019 •

edited

Loading

cloud-fan Dec 12, 2019

HyukjinKwon Dec 12, 2019 •

edited

Loading

HyukjinKwon Dec 13, 2019

maropu Dec 13, 2019

HyukjinKwon commented Dec 12, 2019

SparkQA commented Dec 12, 2019

SparkQA commented Dec 12, 2019

SparkQA commented Dec 12, 2019

HyukjinKwon Dec 13, 2019

SparkQA commented Dec 13, 2019

HyukjinKwon commented Dec 13, 2019

HyukjinKwon commented Dec 13, 2019

maropu commented Dec 13, 2019

maropu commented Dec 13, 2019

viirya Dec 13, 2019

maropu Dec 13, 2019

viirya Dec 13, 2019

maropu Dec 14, 2019

maropu Dec 14, 2019 •

edited

Loading

viirya Dec 14, 2019

HyukjinKwon Dec 14, 2019

maropu Dec 15, 2019

HyukjinKwon Dec 15, 2019

maropu Dec 15, 2019

viirya Dec 13, 2019

viirya Dec 13, 2019

maropu Dec 13, 2019

viirya commented Dec 13, 2019

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

Conversation

maropu commented Dec 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Dec 12, 2019

HyukjinKwon Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 12, 2019

SparkQA commented Dec 12, 2019

SparkQA commented Dec 12, 2019

SparkQA commented Dec 12, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2019

HyukjinKwon commented Dec 13, 2019

HyukjinKwon commented Dec 13, 2019

maropu commented Dec 13, 2019

maropu commented Dec 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Dec 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Dec 13, 2019

HyukjinKwon Dec 12, 2019 •

edited

Loading

maropu Dec 12, 2019 •

edited

Loading

HyukjinKwon Dec 12, 2019 •

edited

Loading

maropu Dec 14, 2019 •

edited

Loading