Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23303][SQL] improve the explain result for data source v2 relations #20477

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 1, 2018

What changes were proposed in this pull request?

The current explain result for data source v2 relation is unreadable:

== Parsed Logical Plan ==
'Filter ('i > 6)
+- AnalysisBarrier
      +- Project [j#1]
         +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Analyzed Logical Plan ==
j: int
Project [j#1]
+- Filter (i#0 > 6)
   +- Project [j#1, i#0]
      +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Optimized Logical Plan ==
Project [j#1]
+- Filter isnotnull(i#0)
   +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

== Physical Plan ==
*(1) Project [j#1]
+- *(1) Filter isnotnull(i#0)
   +- *(1) DataSourceV2Scan [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940

after this PR

== Parsed Logical Plan ==
'Project [unresolvedalias('j, None)]
+- AnalysisBarrier
      +- Relation AdvancedDataSourceV2[i#0, j#1]

== Analyzed Logical Plan ==
j: int
Project [j#1]
+- Relation AdvancedDataSourceV2[i#0, j#1]

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[j#1]

== Physical Plan ==
*(1) Scan AdvancedDataSourceV2[j#1]

== Analyzed Logical Plan ==
i: int, j: int
Filter (i#88 > 3)
+- Relation JavaAdvancedDataSourceV2[i#88, j#89]

== Optimized Logical Plan ==
Filter isnotnull(i#88)
+- Relation JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)])

== Physical Plan ==
*(1) Filter isnotnull(i#88)
+- *(1) Scan JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)])

an example for streaming query

== Parsed Logical Plan ==
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Analyzed Logical Plan ==
value: string, count(1): bigint
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Optimized Logical Plan ==
Aggregate [value#6], [value#6, count(1) AS count(1)#11L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
   +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String
      +- DeserializeToObject value#25.toString, obj#4: java.lang.String
         +- Streaming Relation FakeDataSourceV2$[value#25]

== Physical Plan ==
*(4) HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#11L])
+- StateStoreSave [value#6], state info [ checkpoint = *********(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5], Complete, 0
   +- *(3) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L])
      +- StateStoreRestore [value#6], state info [ checkpoint = *********(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5]
         +- *(2) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L])
            +- Exchange hashpartitioning(value#6, 5)
               +- *(1) HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#16L])
                  +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
                     +- *(1) MapElements <function1>, obj#5: java.lang.String
                        +- *(1) DeserializeToObject value#25.toString, obj#4: java.lang.String
                           +- *(1) Scan FakeDataSourceV2$[value#25]

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86937 has finished for PR 20477 at commit d7cf774.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86957 has finished for PR 20477 at commit 1f61965.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86962 has finished for PR 20477 at commit 1f61965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86963 has finished for PR 20477 at commit 1f61965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86974 has finished for PR 20477 at commit 4ca2c40.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86986 has finished for PR 20477 at commit 4ca2c40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {

override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]

override def simpleString: String = s"Scan $metadataString"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your info, https://github.com/apache/spark/pull/20226/files#diff-3e1258979e16f72a829abb8a1cd88bda is also updating the output of the explain. Overriding the nodeName looks better for UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for overriding nodeName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied on that PR. I don't think overwriting nodeName is the right way to fix the UI issue, as we need to overwrite more methods. We can discuss more on that PR about this problem, but it should not block this PR.

Utils.truncatedString(entries.map {
case (key, value) => key + ": " + StringUtils.abbreviate(redact(value), 100)
}, " (", ", ", ")")
} else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. style

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87064 has finished for PR 20477 at commit a40d18e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87087 has finished for PR 20477 at commit 1556a9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor

@cloud-fan
I have a question about the Optimized Logical Plan. In the "What changed were proposed" section, it is said that after this PR, the Optimized Logical Plan will be as following

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[i#0, j#1]

== Physical Plan ==
*(1) Scan AdvancedDataSourceV2[i#0, j#1] (PushedFilter: [IsNotNull(i), GreaterThan(i,3)])

It seems to me that push down is happened at optimization. Should the optimized logical plan also contain the pushed filter like this?

== Optimized Logical Plan ==
Relation AdvancedDataSourceV2[i#0, j#1] (PushedFilter: [IsNotNull(i), GreaterThan(i,3)])

@cloud-fan cloud-fan force-pushed the explain branch 2 times, most recently from c4bfbf4 to c0c5895 Compare February 7, 2018 05:47
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Feb 7, 2018

The result was out-dated, I've updated the PR description, please check again, thanks!

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87145 has finished for PR 20477 at commit c4bfbf4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87146 has finished for PR 20477 at commit c0c5895.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87152 has finished for PR 20477 at commit c0c5895.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87158 has finished for PR 20477 at commit c0c5895.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87189 has finished for PR 20477 at commit 2b4a095.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

also cc @tdas @jose-torres @zsxwing

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87197 has finished for PR 20477 at commit 0efd5d3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87208 has finished for PR 20477 at commit 0efd5d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87220 has finished for PR 20477 at commit 4bff16d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2018

Test build #87242 has finished for PR 20477 at commit 0cc0600.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 9, 2018

Test build #87247 has finished for PR 20477 at commit 0cc0600.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87350 has finished for PR 20477 at commit 0cc0600.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 13, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87358 has finished for PR 20477 at commit 0cc0600.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM Merged to master.

@asfgit asfgit closed this in f17b936 Feb 13, 2018
@gatorsmile
Copy link
Member

As pointed out by @tdas , since this PR impacts the streaming, I am reverting this PR from master. Thanks!

@tdas
Copy link
Contributor

tdas commented Feb 14, 2018

To be clear, the MicrobatchReader -> DataSourceV2 map added to MicroBatchExecution has potential implications in the scenario of self-joins (that I am trying to debug in #20598).

@gatorsmile
Copy link
Member

Thanks! The PR has been reverted.

@tdas
Copy link
Contributor

tdas commented Feb 14, 2018

Thank you very much @gatorsmile, I promise I will do a proper review of the streaming side when you reopen this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants