[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

JoshRosen · 2016-04-05T06:58:08Z

This patch aims to simplify our build by removing a number of unnecessary excludes. The individual commit messages describe the changes here in more detail, but there are a few key themes:

Remove many excludes in the root POM: According to the Maven documentation, "dependency management takes precedence over dependency mediation". In many cases, we were both excluding transitive dependencies while also using those dependencies in spark-core and pinning those dependencies' versions in the dependencyManagment section of our POM.

As a rough guideline, I think like we should only exclude dependencies under the following scenarios:
- We don't want the dependency at all, in which case we should define a Maven Enforcer rule to ban it.
- We want to include some version of the dependency but don't want to explicitly specify a fixed version. For instance, imagine that libraries A and B depend on some dependency C. We might exclude C from library B so that library A's version of C becomes the final effective version.
- We want to include the dependency only when certain profiles are enabled, so we ban it from all of our transitive dependencies and explicitly re-add it in a profile.
Remove explicitly-promoted transitive dependencies from Hive POM: in an early revision of SPARK-8064, build against Hive 1.2.1 #7191, it looks like we tried to use the core-classified version of hive-exec, which is published as a thin-JAR which does not bundle third-party dependencies' classes. However, both the regular and classified versions of the dependency share the same effective POM, which, in this case, happens to be the dependency-reduced-POM produced by the Maven shade plugin. As a result, users of the core-classified JAR must add direct dependencies on what are logically transitive dependencies of hive-exec, so SPARK-8064, build against Hive 1.2.1 #7191 added several such dependencies in hive/pom.xml.

However, by the end of SPARK-8064, build against Hive 1.2.1 #7191 we abandoned the idea of using an existing Hive JAR and chose to republish our own JAR, so this promotion of transitive dependencies is no longer necessary. As a result, this patch was able to significantly cut down the size of the hive POM by removing those direct dependencies. This had the side-effect of exposing a problem where one of our Janino JARs was being pulled to a lower version as a side-effect of this unnecessary dependency promotion.
Use Maven's new wildcard exclude support: Maven now supports wildcard exclusion patterns, so I used that to simplify the exclusion of org.ow2.asm artifacts.
Improve dev/test-dependencies: a bad regex caused the org.spark-project:hive JARs to not be listed under dev/deps; this is now fixed.
Remove explicit io.netty dependency: we don't seem to need this anymore. I did leave the dependencyManagement entry for now in order to prevent an unexpected downgrade.

/cc @srowen @steveloughran @vanzin @pwendell @rxin for review.

…ive deps.

It seems that the explicit dep. was pulling our version of Janino's commons-compiler a bit lower than it should have been.

See https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management > "dependency management takes precedence over dependency mediation"

JoshRosen · 2016-04-05T06:58:41Z

Oh, /cc @vanzin as well.

JoshRosen · 2016-04-05T07:27:08Z

[warn]  module not found: org.pentaho#pentaho-aggdesigner-algorithm;5.1.5-jhyde
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
[warn] ==== Maven2 Local: tried
[warn]   file:/home/sparkivy/per-executor-caches/6/.m2/repository/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
[warn] ==== local: tried
[warn]   /home/sparkivy/per-executor-caches/6/.ivy2/local/org.pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/ivys/ivy.xml
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.pentaho#pentaho-aggdesigner-algorithm;5.1.5-jhyde: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn] 
[warn]  Note: Unresolved dependencies path:
[warn]      org.pentaho:pentaho-aggdesigner-algorithm:5.1.5-jhyde
[warn]        +- org.apache.calcite:calcite-core:1.2.0-incubating
[warn]        +- org.spark-project.hive:hive-exec:1.2.1.spark ((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]        +- org.apache.spark:spark-yarn_2.11:2.0.0-SNAPSHOT

It looks like compilation failed because the pentaho-aggdesigner-algorithm transitive dependency isn't present in Maven Central; see also https://mail-archives.apache.org/mod_mbox/hive-dev/201412.mbox/%3C571034233.2432062.1419887855285.JavaMail.yahoo@jws10685.mail.bf1.yahoo.com%3E

This is a great illustration of why Spark's build bans the use of dependencies which aren't present in Maven Central: non-central repos can go offline and break a previously-good build.

I did keep the exclusion which is supposed to be preventing this dep. from being pulled in, so I wonder whether this is a problem specific to the SBT build. Let me try re-running with Maven to see if our POM reader plugin could be to blame.

JoshRosen · 2016-04-05T07:27:31Z

Jenkins, retest this please

srowen · 2016-04-05T07:42:09Z

tools/pom.xml

+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-enforcer-plugin</artifactId>
+        <version>1.4.1</version>


Nit: you don't need to repeat the version here

srowen · 2016-04-05T07:42:41Z

Great cleanup!

JoshRosen · 2016-04-05T07:50:03Z

From the Jenkins Maven output:

[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-hive_2.11 ---
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
2/2 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom (2 KB at 4.0 KB/sec)
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom
4/9 KB   
8/9 KB   
9/9 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom (9 KB at 26.3 KB/sec)
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar
4/48 KB   
7/48 KB   
11/48 KB   
14/48 KB   
17/48 KB   
21/48 KB   
25/48 KB   
29/48 KB   
33/48 KB   
34/48 KB   
38/48 KB   
42/48 KB   
46/48 KB   
48/48 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar (48 KB at 99.6 KB/sec)

It looks like our Maven build isn't as strict in its enforcement of only using Maven Central.

I guess maybe Maven will still try to resolve excluded transitive dependencies unless you add an explicit direct dependency? The extra download time by itself isn't a huge deal / that's not necessary what we want to optimize for (otherwise we'd just ban all transitive dependencies and explicitly promote everything to top-level). However, I don't want to rely on clojars so I'll look into restoring that direct dep. on calcite.

JoshRosen · 2016-04-05T07:59:20Z

org.apache.maven.model.building.ModelBuildingException: 10 problems were encountered while building the effective model for org.apache.spark:spark-hive_2.11:2.0.0-SNAPSHOT
[WARNING] 'dependencies.dependency.exclusions.exclusion.artifactId' for org.spark-project.hive:hive-exec:jar with value '*' does not match a valid id pattern. @ org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT, /Users/joshrosen/Documents/spark2/pom.xml, line 1184, column 25

It looks like our custom sbt-pom-reader fork is using an ancient version of Maven which doesn't support this lovely wildcard exclusion syntax. Bummer! Yet another motivation to move back to a stock version; see https://issues.apache.org/jira/browse/SPARK-14401

…n problem.

JoshRosen · 2016-04-05T08:06:27Z

In fact, being able to use the nice wildcard syntax may be a way's off because the official sbt-pom-reader doesn't support Maven 3.3.x due to API incompatibilities: https://github.com/sbt/sbt-pom-reader/blob/master/project/dependencies.scala#L5

JoshRosen · 2016-04-05T08:08:15Z

Actually, we might be able to use sbt-pom-reader 2.10-RC2, since we only require Maven 3.2.1+ and that artifact uses 3.2.2: sbt/sbt-pom-reader#20 (comment)

SparkQA · 2016-04-05T08:10:21Z

Test build #54971 has finished for PR 12171 at commit fc938a0.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T08:46:53Z

Test build #54967 has finished for PR 12171 at commit 51853c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-04-05T10:19:52Z

without looking at the details, I appreciated the ambition —and like that wildcard artifact exclusion.

what is the dependency graph that SBT generates from this?

Joda time is an interesting point ... we've seen some problems related to spark + aws having joda time mismatches. the amazon aws/s3/kinesis libs do require joda time, and there was some conflict creeping in there between the hadoop installation and kinesis imports in the spark job.

SparkQA · 2016-04-05T12:32:45Z

Test build #54972 has finished for PR 12171 at commit 36430de.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-04-05T15:28:11Z

It looks like Hive might actually need Joda time:

- analyze MetastoreRelations *** FAILED ***
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/joda/time/ReadWritableInstant
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:455)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:440)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:226)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:173)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:172)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:215)
  at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:440)
  at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:430)
  at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:351)
  at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:183)
  ...
01:36:40.566 ERROR hive.ql.exec.DDLTask: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyPrimitiveObjectInspectorFactory

https://stackoverflow.com/questions/26259717/facing-java-lang-noclassdeffounderror-org-joda-time-readableinstant-error-even

JoshRosen · 2016-04-05T17:58:51Z

I re-added joda-time; let's see if that fixes Hive.

vanzin · 2016-04-05T19:37:52Z

pom.xml

@@ -840,14 +827,13 @@
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-mapred</artifactId>
        <version>${avro.version}</version>
+        <!-- use the build matching the hadoop api of avro-mapred (i.e. no classifier for hadoop 1


Do we even need this anymore? We don't support hadoop1 in Spark 2.

Good point; we can just hardcode hadoop2 here for now.

vanzin · 2016-04-05T19:41:56Z

LGTM if sbt and maven tests pass.

SparkQA · 2016-04-05T22:12:36Z

Test build #55002 has finished for PR 12171 at commit 0bb11be.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-04-06T03:26:40Z

Hmm, the FlumePollingStreamSuite test looks broken because it has hung twice now. I'll try to reproduce offline.

ash211 · 2016-04-08T06:33:37Z

@JoshRosen Guava and Jetty excludes might have been done when they were shaded:
https://issues.apache.org/jira/browse/SPARK-2848 - shade Guava
https://issues.apache.org/jira/browse/SPARK-3996 - shade Jetty

JoshRosen · 2016-04-08T18:40:13Z

Now that we no longer publish assembly JARs for Spark, I don't think that shading is going to require exclusions because the only reason to do that would be if we wanted to completely prevent Jetty and Guava from appearing in our transitive dependency classpath, a task which currently isn't possible unless we republish all of our deps. which depend on Guava and Jetty.

JoshRosen added 29 commits April 4, 2016 17:05

Update test-dependeices script to not accidentally filter out Spark h…

ca2aef5

…ive deps.

Remove explicit jodd dependency

87dd382

Remove explicit Datanucleus dependency.

ed40fa3

Add Maven Enforcer rule to document org.hsqldb:hsqldb exclusion

38ed734

Remove seemingly-unnecessary Calcite exclude.

aec2f20

Remove explicit Calcite dependency.

02dec1b

It seems that the explicit dep. was pulling our version of Janino's commons-compiler a bit lower than it should have been.

Remove unused Joda dependency.

1cf73d4

Remove unnecessary Jackson excludes:

e8409d0

See https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management > "dependency management takes precedence over dependency mediation"

Remove unnecessary Jackson excludes.

bcd4867

Remove unnecessary jackson-mapper-asl direct dependency in Hive

44ae3e0

Remove unnecessary httpclient exclusions.

e5c221e

Remove unnecessary jsr305 excludes.

3deb2c2

Clean up commons-codec excludes.

7a24ae8

Remove commented-out portions of the Hive build.

55e7cc8

Remove unnecessary Janino exclude.

482840a

Remove unnecessary libfb303 direct dependency.

9eb49f2

Remove a bunch of unnecessary logging-related excludes.

35d539d

Remove a stray httpclient exclude.

805f55b

Document org.ow2.asm exclusions.

277e56a

Clarify comments surrounding Hive's Avro dep. classifier.

4f7c0c5

Remove unnecessary io.netty exclusions; add clarifying comment.

1abd1f0

Add enforcer rule for excluding javax.servlet:servlet-api

8eb56cd

Add enforcer rule for mockito-all exclusion.

6b64539

Remove another stray slf4j exclude.

b1809b7

More unnecessary exclude removal; reudce scope of Kryo exclude.

e5ca86e

Remove unnecessary Zookeeper excludes

393696c

Remove unnecessary curator excludes

17afbe2

Remove inherited exclusion.

d736e7e

Fix typo.

51853c7

JoshRosen changed the title ~~[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM~~ [SPARK-14399][test-maven] Remove unnecessary excludes from POMs and simplify Hive POM Apr 5, 2016

srowen reviewed Apr 5, 2016
View reviewed changes

JoshRosen changed the title ~~[SPARK-14399][test-maven] Remove unnecessary excludes from POMs and simplify Hive POM~~ [SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM Apr 5, 2016

Roll back a subset of Calcite changes to avoid pentaho dep. resolutio…

fc938a0

…n problem.

Pin Janino commons-compiler version.

36430de

Re-add joda-time dependency to fix Hive tests

0bb11be

Remove redundant enforcer plugin version.

c5635f5

vanzin reviewed Apr 5, 2016
View reviewed changes

JoshRosen added 2 commits April 5, 2016 13:33

Remove avro.mapred.classifier, since it's now fixed at hadoop2

b7c8c6f

Remove unnecessary Guava excludes.

f1eb02a

JoshRosen closed this Apr 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

srowen Apr 5, 2016

srowen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

SparkQA commented Apr 5, 2016

SparkQA commented Apr 5, 2016

steveloughran commented Apr 5, 2016

SparkQA commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

vanzin Apr 5, 2016

JoshRosen Apr 5, 2016

vanzin commented Apr 5, 2016

SparkQA commented Apr 5, 2016

JoshRosen commented Apr 6, 2016

ash211 commented Apr 8, 2016

JoshRosen commented Apr 8, 2016

[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

Conversation

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

srowen Apr 5, 2016

Choose a reason for hiding this comment

srowen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

SparkQA commented Apr 5, 2016

SparkQA commented Apr 5, 2016

steveloughran commented Apr 5, 2016

SparkQA commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

JoshRosen commented Apr 5, 2016

vanzin Apr 5, 2016

Choose a reason for hiding this comment

JoshRosen Apr 5, 2016

Choose a reason for hiding this comment

vanzin commented Apr 5, 2016

SparkQA commented Apr 5, 2016

JoshRosen commented Apr 6, 2016

ash211 commented Apr 8, 2016

JoshRosen commented Apr 8, 2016