Roll forward "[SPARK-23096][SS] Migrate rate source to V2" #20922

jose-torres · 2018-03-28T17:58:07Z

What changes were proposed in this pull request?

Roll forward c68ec4e (#20688).

There are two minor test changes required:

An error which used to be TreeNodeException[ArithmeticException] is no longer wrapped and is now just ArithmeticException.
The test framework simply does not set the active Spark session. (Or rather, it doesn't do so early enough - I think it only happens when a query is analyzed.) I've added the required logic to SQLTestUtils.

How was this patch tested?

existing tests

## What changes were proposed in this pull request? This PR migrate micro batch rate source to V2 API and rewrite UTs to suite V2 test. ## How was this patch tested? UTs. Author: jerryshao <[email protected]> Closes apache#20688 from jerryshao/SPARK-23096.

jose-torres · 2018-03-28T17:59:19Z

@jerryshao - sorry to hijack the roll-forward - I'm excited about this PR and really want it in :)

@tdas @gatorsmile

gatorsmile · 2018-03-28T19:27:14Z

sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala

@@ -64,6 +64,7 @@ private[sql] trait SQLTestUtils extends SparkFunSuite with SQLTestUtilsBase with
    if (loadTestDataBeforeTests) {
      loadTestData()
    }
+    SparkSession.setActiveSession(spark)


The active session should be set before we execute the plan, right? For example, in QueryExecution for each query. What is the reason we need to do it here?

The active session is required for instantiating the DataSourceReader, which is done at planning time (spark.readStream.{...}.load()) in order to determine the schema.

Discussed offline. What we should do instead is set the default session when the test spark session is initialized, since that initialization doesn't invoke SparkSession.getOrCreate() which normally sets it.

(This issue was spun off into #20926.)

SparkQA · 2018-03-28T21:22:59Z

Test build #88669 has finished for PR 20922 at commit 8e19125.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-29T00:02:50Z

Test build #88674 has finished for PR 20922 at commit f494fb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-29T00:18:07Z

Test build #88676 has finished for PR 20922 at commit 2fec90b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-03-29T01:13:37Z

Thanks for the help @jose-torres .

jerryshao · 2018-03-29T01:23:56Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProvider.scala

+  override def createMicroBatchReader(
+      schema: Optional[StructType],
+      checkpointLocation: String,
+      options: DataSourceOptions): MicroBatchReader = {


Here if MicrobatchReadSupport could pass in SparkSession parameter like StreamSourceProvider#createSource (sqlContext), then it is not required to get session from thread local variable or default variable in the specific source, also the UT doesn't required to setDefaultSession.

That's what I thought when I did this refactoring work.

What do you think @jose-torres @tdas @gatorsmile ?

I agree that there's a mismatch here.

The reason it doesn't currently have this parameter is that one of the DataSourceV2 design goals (https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit#heading=h.mi1fbff5f8f9) was to avoid API dependencies on upper level APIs like SparkSession. (IIRC Wenchen and I discussed SparkSession specifically in the design stage.) In this story, SparkSession.get{Active/Default}Session is just a way to keep our existing sources working rather than an encouraged development practice.

I agree that there's a mismatch which could be worth some discussion, but I think it's out of scope for this PR.

Thanks for the explanation @jose-torres . This seems like a quite common usage scenario, I also see that socket/kafka source and console sink require SparkSession, also in my customized hive streaming sink (https://github.com/jerryshao/spark-hive-streaming-sink/blob/7b3afcee280d2e70ffb12dde24184726b618829d/core/src/main/scala/com/hortonworks/spark/hive/HiveSourceProvider.scala#L47). If we add that parameter back, things might be much easier.

What's your opinion @cloud-fan ?

Can you give some concrete examples about why we need SparkSession in data source implementations? If it's only for config, we should use DataSourceOptions. If you wanna access some internal states of Spark, we should improve the interface to exposes these states explicitly.

For example if one specific source requires HDFSMetadataLog for recovery, then it requires SparkSession. Also for my case, I need to get table description from catalog, it also requires SparkSession. Though this may not be a typical use case, but I think it might be easy for user if we expose SparkSession.

Why does HDFSMetadataLog need SparkSession?

For table description, I'd like to explicitly pass it via the interface, not SparkSession.

One of constructor parameter of HDFSMetadataLog is SparkSession.

we can change it. It only needs a hadoop conf.

We should not pass SparkSession just for getting hadoopConf, right? cc @zsxwing

SparkQA · 2018-03-30T07:05:01Z

Test build #88734 has finished for PR 20922 at commit 9129f72.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-30T07:15:51Z

retest this please

SparkQA · 2018-03-30T08:00:07Z

Test build #88743 has finished for PR 20922 at commit 9129f72.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-30T10:09:26Z

retest this please

SparkQA · 2018-03-30T13:29:37Z

Test build #88746 has finished for PR 20922 at commit 9129f72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-30T13:54:43Z

thanks, merging to master!

## What changes were proposed in this pull request? Roll forward c68ec4e (apache#20688). There are two minor test changes required: * An error which used to be TreeNodeException[ArithmeticException] is no longer wrapped and is now just ArithmeticException. * The test framework simply does not set the active Spark session. (Or rather, it doesn't do so early enough - I think it only happens when a query is analyzed.) I've added the required logic to SQLTestUtils. ## How was this patch tested? existing tests Author: Jose Torres <[email protected]> Author: jerryshao <[email protected]> Closes apache#20922 from jose-torres/ratefix.

Roll forward c68ec4e (apache#20688). There are two minor test changes required: * An error which used to be TreeNodeException[ArithmeticException] is no longer wrapped and is now just ArithmeticException. * The test framework simply does not set the active Spark session. (Or rather, it doesn't do so early enough - I think it only happens when a query is analyzed.) I've added the required logic to SQLTestUtils. existing tests Author: Jose Torres <[email protected]> Author: jerryshao <[email protected]> Closes apache#20922 from jose-torres/ratefix. Ref: LIHADOOP-48531

jerryshao and others added 2 commits March 28, 2018 10:30

test fixes

8e19125

jose-torres changed the title ~~Ratefix~~ Roll forward "[SPARK-23096][SS] Migrate rate source to V2" (#20921) Mar 28, 2018

jose-torres changed the title ~~Roll forward "[SPARK-23096][SS] Migrate rate source to V2" (#20921)~~ Roll forward "[SPARK-23096][SS] Migrate rate source to V2" Mar 28, 2018

gatorsmile reviewed Mar 28, 2018

View reviewed changes

jose-torres added 3 commits March 28, 2018 13:28

set default instead of active

f494fb9

move inside constructor

d4ecf6c

move inside right constructor

2fec90b

jerryshao reviewed Mar 29, 2018

View reviewed changes

Merge remote-tracking branch 'apache/master' into ratefix

9129f72

asfgit closed this in 5b5a36e Mar 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roll forward "[SPARK-23096][SS] Migrate rate source to V2" #20922

Roll forward "[SPARK-23096][SS] Migrate rate source to V2" #20922

jose-torres commented Mar 28, 2018

jose-torres commented Mar 28, 2018

gatorsmile Mar 28, 2018

jose-torres Mar 28, 2018

jose-torres Mar 28, 2018

jose-torres Mar 30, 2018

SparkQA commented Mar 28, 2018

SparkQA commented Mar 29, 2018

SparkQA commented Mar 29, 2018

jerryshao commented Mar 29, 2018

jerryshao Mar 29, 2018 •

edited

Loading

jerryshao Mar 29, 2018

jose-torres Mar 29, 2018

jerryshao Mar 29, 2018 •

edited

Loading

cloud-fan Mar 29, 2018

jerryshao Mar 29, 2018

cloud-fan Mar 29, 2018 •

edited

Loading

jerryshao Mar 29, 2018

cloud-fan Mar 29, 2018

gatorsmile Mar 29, 2018

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

Roll forward "[SPARK-23096][SS] Migrate rate source to V2" #20922

Roll forward "[SPARK-23096][SS] Migrate rate source to V2" #20922

Conversation

jose-torres commented Mar 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

jose-torres commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 28, 2018

SparkQA commented Mar 29, 2018

SparkQA commented Mar 29, 2018

jerryshao commented Mar 29, 2018

jerryshao Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

SparkQA commented Mar 30, 2018

cloud-fan commented Mar 30, 2018

jerryshao Mar 29, 2018 •

edited

Loading

jerryshao Mar 29, 2018 •

edited

Loading

cloud-fan Mar 29, 2018 •

edited

Loading