[SPARK-12616] [SQL] Making Logical Operator `Union` Support Arbitrary Number of Children #10577

gatorsmile · 2016-01-04T20:27:50Z

The existing Union logical operator only supports two children. Thus, adding a new logical operator Unions which can have arbitrary number of children to replace the existing one.

Union logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent Unions into a single Unions. Note that this problem doesn't exist in physical plan, because the physical Unions already supports arbitrary number of children.

…dren

gatorsmile · 2016-01-04T20:28:45Z

@rxin Could you check if this implementation is what you expects? Thanks!

rxin · 2016-01-04T20:29:44Z

Maybe we should just remove the old Union and call the new one Union?

gatorsmile · 2016-01-04T20:30:48Z

Yeah, it will be better. Will do the change tonight. Thanks!

marmbrus · 2016-01-04T20:30:49Z

+1, I'd prefer if there is only one operator that performs unions

rxin · 2016-01-04T23:00:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match {
+    case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)


you should write this without using recursion to avoid stack overflow.

I see. Removing Union introduces a lot of work, but almost done. Will submit a commit tomorrow. Thanks!

Another option would just be to do this at construction time, that way we can avoid paying the cost in the analyzer. This would still limit the cases we could cache (i.e. we'd miss cached data unioned with other data), but that doesn't seem like a huge deal.

I'd leave this rule here either way.

To do this at construction time, we need to introduce a new Dataframe API unionAll that can combine more than two Dataframes? @marmbrus @rxin

Is my understanding correct? Thank you!

Hi @marmbrus Could I ask you a question regarding your comment here? I don't understand
the following sentence. Could you give me an example? Thanks!

i.e. we'd miss cached data unioned with other data

gatorsmile · 2016-01-05T09:23:12Z

Major changes in this commit:

Remove the old binary logical operator node. Union.
Replace the previous recursive-based combineUnioins solution with a solution based on foldLeft.

Todo:

Will add the new Dataframe and Dataset APIs for unionAll, if my understanding is correct.
Will change the optimizer rule for pushing Filter and Project through Unions.
Will rename Unions to Union

Thanks!

SparkQA · 2016-01-05T10:56:08Z

Test build #48756 has finished for PR 10577 at commit c1f66f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-01-05T17:43:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

-/**
- * A pattern that collects all adjacent unions and returns their children as a Seq.
- */
-object Unions {


I'm not sure I would get rid of this, just use it in your optimization rule.

Sure, will reimplement it using this way.

marmbrus · 2016-01-05T17:44:27Z

Will add the new Dataframe and Dataset APIs for unionAll, if my understanding is correct.

You don't need to add any new APIs, just call the optimizer rule directly on any existing API that adds a Union.

gatorsmile · 2016-01-05T21:14:45Z

Understood it. Thank you! Will not introduce new APIs.

gatorsmile · 2016-01-05T21:15:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

+      require(children.forall(_.output.length == children.head.output.length))
+
+      val castedTypes: Seq[Option[DataType]] =
+        children.tail.foldLeft(children.head.output.map(a => Option(a.dataType))) {


There is a bug in this function. Will fix it tonight. Thanks!

SparkQA · 2016-01-19T01:06:21Z

Test build #49634 has finished for PR 10577 at commit f112026.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Union(children: Seq[LogicalPlan]) extends LogicalPlan

SparkQA · 2016-01-19T02:05:50Z

Test build #49644 has finished for PR 10577 at commit 4f71741.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-19T18:16:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

        case (e, _) => e
      }
-      Project(casted, plan)
+      if (casted.exists(_.isInstanceOf[Alias])) Project(casted, plan) else plan


no need to do this optimization, the Optimizer is smart enough to remove this unnecessary Project

Sure, let me remove it. : )

cloud-fan · 2016-01-19T18:22:16Z

LGTM except one minor comment

…ewNew

SparkQA · 2016-01-20T00:45:46Z

Test build #49722 has finished for PR 10577 at commit c63f237.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-20T03:15:21Z

The test failure was caused by a PR, which has been reverted.

…ewNewNewNew

gatorsmile · 2016-01-20T05:38:29Z

retest this please.

SparkQA · 2016-01-20T07:21:44Z

Test build #49760 has finished for PR 10577 at commit c63f237.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ewNewNewNew

gatorsmile · 2016-01-20T17:30:48Z

The latest merge is for resolving the conflicts.

SparkQA · 2016-01-20T19:27:38Z

Test build #49792 has finished for PR 10577 at commit c18381e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Union(children: Seq[LogicalPlan]) extends LogicalPlan

gatorsmile · 2016-01-20T21:18:49Z

@rxin @marmbrus , could you please review the latest changes? Thank you!

cloud-fan · 2016-01-20T22:34:52Z

LGTM

rxin · 2016-01-20T22:59:14Z

Thanks - I'm going to merge this.

gatorsmile · 2016-01-21T06:27:46Z

Really appreciate your reviews!!! : )

Huang-yi-3456 · 2020-04-07T20:37:06Z

Hi @gatorsmile could you please kindly explain the comment for union method?
def union(other: Dataset[T]): Dataset[T] = withSetOperator {
// This breaks caching, but it's usually ok because it addresses a very specific use case:
// using union to union many files or partitions.
CombineUnions(Union(logicalPlan, other.logicalPlan)).mapChildren(AnalysisBarrier)
}
What does it mean?

This breaks caching

it would be really helpful to give me an example.
thanks very much in advance.

cloud-fan · 2020-04-08T04:25:55Z

The cache key is the logical plan. If a is cached, ideally a.union(b) should leverage the cache of a, but we can't as a tradeoff to make the logical plan simple.

Huang-yi-3456 · 2020-04-08T06:47:29Z

@cloud-fan thanks for your quick response. I have a simple test, in which a is cached and b is not and here is the output of explain method:

== Parsed Logical Plan ==
Union
:- AnalysisBarrier
: +- LogicalRDD [number#2, word#3], false
+- AnalysisBarrier
+- LogicalRDD [number#8, word#9], false

== Analyzed Logical Plan ==
number: int, word: string
Union
:- LogicalRDD [number#2, word#3], false
+- LogicalRDD [number#8, word#9], false

== Optimized Logical Plan ==
Union
:- InMemoryRelation [number#2, word#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Scan ExistingRDD[number#2,word#3]
+- LogicalRDD [number#8, word#9], false

== Physical Plan ==
Union
:- InMemoryTableScan [number#2, word#3]
: +- InMemoryRelation [number#2, word#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Scan ExistingRDD[number#2,word#3]
+- Scan ExistingRDD[number#8,word#9]

It seems the cached a is used. Please bear my ignorance and correct me what's wrong here.
BTW, the spark i use is 2.3.0.
Thanks.

cloud-fan · 2020-04-08T08:45:05Z

it works because a is not a union so this change is a noop to it. you can cache a.union(b) and then run a.union(b).union(c).

Huang-yi-3456 · 2020-04-08T15:10:42Z

Thanks @cloud-fan !!

gatorsmile and others added 12 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

added a new logical operator UNIONS

73270c8

Merge remote-tracking branch 'upstream/master' into unionAllMultiChil…

d9811c7

…dren

remove the old operator union

5d031a7

rxin reviewed Jan 4, 2016
View reviewed changes

remove the old operator union #2.

c1f66f7

rename.

c1dcd02

marmbrus reviewed Jan 5, 2016
View reviewed changes

gatorsmile reviewed Jan 5, 2016
View reviewed changes

gatorsmile added 3 commits January 5, 2016 19:29

address the comments.

51ad5b2

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge branch 'unionAllMultiChildren' into unionAllMC

5681ca8

gatorsmile changed the title ~~[SPARK-12616] [SQL] Adding a New Logical Operator Unions~~ [SPARK-12616] [SQL] Making Logical Operator Union Support Arbitrary Number of Children Jan 6, 2016

address comments.

4f71741

cloud-fan reviewed Jan 19, 2016
View reviewed changes

gatorsmile added 3 commits January 19, 2016 13:13

Merge remote-tracking branch 'upstream/master'

9422a4f

Merge branch 'unionAllMCMergedNewNewNew' into unionAllMCMergedNewNewN…

59b5895

…ewNew

address comments.

c63f237

gatorsmile added 2 commits January 19, 2016 19:38

Merge remote-tracking branch 'upstream/master' into unionAllMCMergedN…

2e8562d

…ewNewNewNew

Merge remote-tracking branch 'upstream/master' into unionAllMCMergedN…

a571998

…ewNewNewNew

gatorsmile added 2 commits January 20, 2016 09:10

Merge remote-tracking branch 'upstream/master'

52bdf48

Merge branch 'unionAllMCMergedNewNewNewNew' into unionAllMCMergedNewN…

c18381e

…ewNewNewNew

asfgit closed this in 8f90c15 Jan 20, 2016

[SPARK-12616] [SQL] Making Logical Operator Union Support Arbitrary Number of Children #10577

[SPARK-12616] [SQL] Making Logical Operator Union Support Arbitrary Number of Children #10577

Conversation

gatorsmile commented Jan 4, 2016

gatorsmile commented Jan 4, 2016

rxin commented Jan 4, 2016

gatorsmile commented Jan 4, 2016

marmbrus commented Jan 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 5, 2016

SparkQA commented Jan 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marmbrus commented Jan 5, 2016

gatorsmile commented Jan 5, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 19, 2016

SparkQA commented Jan 20, 2016

gatorsmile commented Jan 20, 2016

gatorsmile commented Jan 20, 2016

SparkQA commented Jan 20, 2016

gatorsmile commented Jan 20, 2016

SparkQA commented Jan 20, 2016

gatorsmile commented Jan 20, 2016

cloud-fan commented Jan 20, 2016

rxin commented Jan 20, 2016

gatorsmile commented Jan 21, 2016

Huang-yi-3456 commented Apr 7, 2020

cloud-fan commented Apr 8, 2020

Huang-yi-3456 commented Apr 8, 2020 • edited Loading

cloud-fan commented Apr 8, 2020

Huang-yi-3456 commented Apr 8, 2020

[SPARK-12616] [SQL] Making Logical Operator `Union` Support Arbitrary Number of Children #10577

[SPARK-12616] [SQL] Making Logical Operator `Union` Support Arbitrary Number of Children #10577

Huang-yi-3456 commented Apr 8, 2020 •

edited

Loading