Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain #26861

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 12, 2019

What changes were proposed in this pull request?

This pr intends to support explain modes implemented in #26829 for PySpark.

Why are the changes needed?

For better debugging info. in PySpark dataframes.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs.

@maropu
Copy link
Member Author

maropu commented Dec 12, 2019

I think the explain modes look useful for debugging, but I'm not sure that this fix to add an optional param in explain is a right approach. Could you check this, @HyukjinKwon @viirya ?

* @group basic
* @since 3.0.0
*/
def explain(mode: ExplainMode): Unit = {
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there's a similar case at mode at DataFrameWriter. How about we have explain(mode: String) only instead of explain(mode: ExplainMode)? For enum one, I am not sure actually yet (e.g., joinType).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. So, you mean ExplainMode is internally used only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my thinking. WDYT?

Copy link
Member Author

@maropu maropu Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, to me, the string argument looks more useful than enum (cuz we don't import anything for that and that interface is easy-to-use from python). But, we might need more comments about this. cc: @cloud-fan @dongjoon-hyun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But SaveMode is public. We can have both explain(String) and explain(ExplainMode)

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But joinType doesn't expose enums as an example. ExplainMode was added in Spark 3.0 so we don't necessarily expose another API. Actually, isn't using string easier given that explain will be used in a debugging purpose more often?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys don't feel strongly, can we just have explain(String) alone for now? I somewhat feel a bit strong that we should better start from fewer APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I like fewer API designs and I think enum arguments doesn't matter for python/R users.

@HyukjinKwon
Copy link
Member

+1

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115223 has finished for PR 26861 at commit 295e840.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the ExplainModeInPython branch from 0def203 to 60071fe Compare December 12, 2019 14:40
@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115241 has finished for PR 26861 at commit 0def203.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115243 has finished for PR 26861 at commit 60071fe.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -253,10 +253,18 @@ def printSchema(self):
print(self._jdf.schema().treeString())

@since(1.3)
def explain(self, extended=False):
def explain(self, extended=None, mode=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here LGTM

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115257 has finished for PR 26861 at commit 56509fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Let me merge and address my own comment #26861 (comment) separately since this PR targets to add a Python corresponding API.

@HyukjinKwon
Copy link
Member

Merged to master.

@maropu
Copy link
Member Author

maropu commented Dec 13, 2019

Thanks, @HyukjinKwon!

Let me merge and address my own comment #26861 (comment) separately since this PR targets to add a Python corresponding API.

Yea, we need more time to discuss that API design.

@maropu
Copy link
Member Author

maropu commented Dec 13, 2019

I filed jira for the explain mode of SparkR; https://issues.apache.org/jira/browse/SPARK-30255
But, since I'm not familiar with R code, I won't work on that.

@@ -253,10 +253,18 @@ def printSchema(self):
print(self._jdf.schema().treeString())

@since(1.3)
def explain(self, extended=False):
def explain(self, extended=None, mode=None):
"""Prints the (logical and physical) plans to the console for debugging purpose.

:param extended: boolean, default ``False``. If ``False``, prints only the physical plan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I did the same thing with sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I meant we should update this param doc. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

Copy link
Member Author

@maropu maropu Dec 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the description about the default is the same with withReplacement in sample even if withReplacement=None? https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L838

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change it together? @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll submit a PR to fix up tomorrow together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I noticed your comment now, @HyukjinKwon. If we still need minor fixes on dataframe.py, can you have more follow-ups? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this one, default value is fine. Although it's None, it works like false. It's just to support different combinations of arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's "it works like false"...., so I felt its difficult to explain about that in the param docs...

Comment on lines +308 to +312
argtypes = [
str(type(arg)) for arg in [extended, mode] if arg is not None]
raise TypeError(
"extended (optional) and mode (optional) should be a bool and str; "
"however, got [%s]." % ", ".join(argtypes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a wrong mode is given, this will print like:

extended (optional) and mode (optional) should be a bool and str; however, got wrong_type_for_mode.

Not big deal, but maybe only print out corresponding arg name if it is given wrong type.

@@ -550,8 +545,29 @@ class Dataset[T] private[sql](
case ExplainMode.Formatted =>
qe.simpleString(formatted = true)
}
}

private[sql] def toExplainString(mode: String): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for Python to call? I think it is better to add short comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. ok, I'll follow up, thanks for the comment.

@viirya
Copy link
Member

viirya commented Dec 13, 2019

LGTM too. Thanks @maropu for this work!

dongjoon-hyun pushed a commit that referenced this pull request Dec 14, 2019
…park df.explain

### What changes were proposed in this pull request?

This pr is a followup of #26861 to address minor comments from viirya.

### Why are the changes needed?

For better error messages.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested.

Closes #26886 from maropu/SPARK-30231-FOLLOWUP.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants