[SPARK-23303][SQL] improve the explain result for data source v2 relations #20477

cloud-fan · 2018-02-01T15:24:56Z

What changes were proposed in this pull request?

The current explain result for data source v2 relation is unreadable:

== Parsed Logical Plan ==
'Filter ('i > 6)
+- AnalysisBarrier
      +- Project [j#1]
         +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Analyzed Logical Plan ==
j: int
Project [j#1]
+- Filter (i#0 > 6)
   +- Project [j#1, i#0]
      +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Optimized Logical Plan ==
Project [j#1]
+- Filter isnotnull(i#0)
   +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Physical Plan ==
*(1) Project [j#1]
+- *(1) Filter isnotnull(i#0)
   +- *(1) DataSourceV2Scan [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

after this PR

== Parsed Logical Plan ==
'Project [unresolvedalias('j, None)]
+- AnalysisBarrier
      +- Relation AdvancedDataSourceV2[i#0, j#1]

== Analyzed Logical Plan ==
j: int
Project [j#1]
+- Relation AdvancedDataSourceV2[i#0, j#1]

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[j#1]

== Physical Plan ==
*(1) Scan AdvancedDataSourceV2[j#1]

== Analyzed Logical Plan ==
i: int, j: int
Filter (i#88 > 3)
+- Relation JavaAdvancedDataSourceV2[i#88, j#89]

== Optimized Logical Plan ==
Filter isnotnull(i#88)
+- Relation JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)])

== Physical Plan ==
*(1) Filter isnotnull(i#88)
+- *(1) Scan JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)])

an example for streaming query

== Parsed Logical Plan ==
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Analyzed Logical Plan ==
value: string, count(1): bigint
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Optimized Logical Plan ==
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject value#25.toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Physical Plan ==
*(4) HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#11L])
+- StateStoreSave [value#6], state info [ checkpoint = *********(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5], Complete, 0
   +- *(3) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L])
      +- StateStoreRestore [value#6], state info [ checkpoint = *********(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5]
         +- *(2) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L])
            +- Exchange hashpartitioning(value#6, 5)
               +- *(1) HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#16L])
                  +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
                     +- *(1) MapElements <function1>, obj#5: java.lang.String
                        +- *(1) DeserializeToObject value#25.toString, obj#4: java.lang.String
                           +- *(1) Scan FakeDataSourceV2$[value#25]

How was this patch tested?

N/A

cloud-fan · 2018-02-01T15:25:31Z

cc @rxin @gatorsmile @rdblue @jose-torres

SparkQA · 2018-02-01T15:36:43Z

Test build #86937 has finished for PR 20477 at commit d7cf774.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T03:49:49Z

Test build #86957 has finished for PR 20477 at commit 1f61965.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-02T03:52:24Z

retest this please

SparkQA · 2018-02-02T06:46:12Z

Test build #86962 has finished for PR 20477 at commit 1f61965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T06:57:36Z

Test build #86963 has finished for PR 20477 at commit 1f61965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T08:05:01Z

Test build #86974 has finished for PR 20477 at commit 4ca2c40.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-02T10:31:29Z

retest this please

SparkQA · 2018-02-02T13:35:50Z

Test build #86986 has finished for PR 20477 at commit 4ca2c40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-02T18:47:17Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala

  extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {

  override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]

+  override def simpleString: String = s"Scan $metadataString"


For your info, https://github.com/apache/spark/pull/20226/files#diff-3e1258979e16f72a829abb8a1cd88bda is also updating the output of the explain. Overriding the nodeName looks better for UI.

+1 for overriding nodeName.

I've replied on that PR. I don't think overwriting nodeName is the right way to fix the UI issue, as we need to overwrite more methods. We can discuss more on that PR about this problem, but it should not block this PR.

gatorsmile · 2018-02-02T18:48:17Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceReaderHolder.scala

+      Utils.truncatedString(entries.map {
+        case (key, value) => key + ": " + StringUtils.abbreviate(redact(value), 100)
+      }, " (", ", ", ")")
+    } else ""


SparkQA · 2018-02-05T13:57:11Z

Test build #87064 has finished for PR 20477 at commit a40d18e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-06T06:18:11Z

Test build #87087 has finished for PR 20477 at commit 1556a9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-02-07T04:41:41Z

@cloud-fan
I have a question about the Optimized Logical Plan. In the "What changed were proposed" section, it is said that after this PR, the Optimized Logical Plan will be as following

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[i#0, j#1]

== Physical Plan ==
*(1) Scan AdvancedDataSourceV2[i#0, j#1] (PushedFilter: [IsNotNull(i), GreaterThan(i,3)])

It seems to me that push down is happened at optimization. Should the optimized logical plan also contain the pushed filter like this?

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[i#0, j#1] (PushedFilter: [IsNotNull(i), GreaterThan(i,3)])

cloud-fan · 2018-02-07T05:48:45Z

The result was out-dated, I've updated the PR description, please check again, thanks!

SparkQA · 2018-02-07T08:05:01Z

Test build #87145 has finished for PR 20477 at commit c4bfbf4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-07T08:05:02Z

Test build #87146 has finished for PR 20477 at commit c0c5895.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-07T08:57:22Z

retest this please

SparkQA · 2018-02-07T11:20:17Z

Test build #87152 has finished for PR 20477 at commit c0c5895.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-07T13:11:21Z

retest this please

SparkQA · 2018-02-07T16:16:18Z

Test build #87158 has finished for PR 20477 at commit c0c5895.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T06:50:14Z

Test build #87189 has finished for PR 20477 at commit 2b4a095.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-08T07:20:10Z

also cc @tdas @jose-torres @zsxwing

SparkQA · 2018-02-08T08:05:01Z

Test build #87197 has finished for PR 20477 at commit 0efd5d3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-08T08:56:03Z

retest this please

SparkQA · 2018-02-08T12:01:12Z

Test build #87208 has finished for PR 20477 at commit 0efd5d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T15:54:50Z

Test build #87220 has finished for PR 20477 at commit 4bff16d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-09T05:09:28Z

Test build #87242 has finished for PR 20477 at commit 0cc0600.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-09T05:34:59Z

retest this please

SparkQA · 2018-02-09T07:14:55Z

Test build #87247 has finished for PR 20477 at commit 0cc0600.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-12T22:05:46Z

retest this please

SparkQA · 2018-02-13T00:23:25Z

Test build #87350 has finished for PR 20477 at commit 0cc0600.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-13T01:11:34Z

retest this please

SparkQA · 2018-02-13T04:12:04Z

Test build #87358 has finished for PR 20477 at commit 0cc0600.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-13T05:11:46Z

LGTM Merged to master.

gatorsmile · 2018-02-14T00:13:44Z

As pointed out by @tdas , since this PR impacts the streaming, I am reverting this PR from master. Thanks!

tdas · 2018-02-14T00:19:00Z

To be clear, the MicrobatchReader -> DataSourceV2 map added to MicroBatchExecution has potential implications in the scenario of self-joins (that I am trying to debug in #20598).

gatorsmile · 2018-02-14T00:22:53Z

Thanks! The PR has been reverted.

tdas · 2018-02-14T00:32:58Z

Thank you very much @gatorsmile, I promise I will do a proper review of the streaming side when you reopen this PR.

cloud-fan force-pushed the explain branch from d7cf774 to 1f61965 Compare February 2, 2018 01:30

cloud-fan force-pushed the explain branch from 1f61965 to 4ca2c40 Compare February 2, 2018 07:09

gatorsmile reviewed Feb 2, 2018

View reviewed changes

cloud-fan force-pushed the explain branch from 4ca2c40 to a40d18e Compare February 5, 2018 10:35

cloud-fan force-pushed the explain branch from a40d18e to 1556a9f Compare February 6, 2018 03:08

cloud-fan force-pushed the explain branch 2 times, most recently from c4bfbf4 to c0c5895 Compare February 7, 2018 05:47

cloud-fan force-pushed the explain branch from c0c5895 to 2b4a095 Compare February 8, 2018 04:39

improve the explain result for data source v2 relations

a3acb97

cloud-fan force-pushed the explain branch from 0efd5d3 to 4bff16d Compare February 8, 2018 15:50

fix streaming

0cc0600

cloud-fan force-pushed the explain branch from 4bff16d to 0cc0600 Compare February 9, 2018 03:34

asfgit closed this in f17b936 Feb 13, 2018

[SPARK-23303][SQL] improve the explain result for data source v2 relations #20477

[SPARK-23303][SQL] improve the explain result for data source v2 relations #20477

Conversation

cloud-fan commented Feb 1, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Feb 1, 2018

SparkQA commented Feb 1, 2018

SparkQA commented Feb 2, 2018

cloud-fan commented Feb 2, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 2, 2018

cloud-fan commented Feb 2, 2018

SparkQA commented Feb 2, 2018

gatorsmile Feb 2, 2018

Choose a reason for hiding this comment

rdblue Feb 2, 2018

Choose a reason for hiding this comment

cloud-fan Feb 6, 2018

Choose a reason for hiding this comment

gatorsmile Feb 2, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

SparkQA commented Feb 6, 2018

huaxingao commented Feb 7, 2018

cloud-fan commented Feb 7, 2018 • edited Loading

SparkQA commented Feb 7, 2018

SparkQA commented Feb 7, 2018

cloud-fan commented Feb 7, 2018

SparkQA commented Feb 7, 2018

cloud-fan commented Feb 7, 2018

SparkQA commented Feb 7, 2018

SparkQA commented Feb 8, 2018

cloud-fan commented Feb 8, 2018

SparkQA commented Feb 8, 2018

cloud-fan commented Feb 8, 2018

SparkQA commented Feb 8, 2018

SparkQA commented Feb 8, 2018

SparkQA commented Feb 9, 2018

cloud-fan commented Feb 9, 2018

SparkQA commented Feb 9, 2018

gatorsmile commented Feb 12, 2018

SparkQA commented Feb 13, 2018

kiszk commented Feb 13, 2018

SparkQA commented Feb 13, 2018

gatorsmile commented Feb 13, 2018

gatorsmile commented Feb 14, 2018

tdas commented Feb 14, 2018

gatorsmile commented Feb 14, 2018

tdas commented Feb 14, 2018

cloud-fan commented Feb 1, 2018 •

edited

Loading

cloud-fan commented Feb 7, 2018 •

edited

Loading