Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM #12171

Closed
wants to merge 35 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch aims to simplify our build by removing a number of unnecessary excludes. The individual commit messages describe the changes here in more detail, but there are a few key themes:

  • Remove many excludes in the root POM: According to the Maven documentation, "dependency management takes precedence over dependency mediation". In many cases, we were both excluding transitive dependencies while also using those dependencies in spark-core and pinning those dependencies' versions in the dependencyManagment section of our POM.

    As a rough guideline, I think like we should only exclude dependencies under the following scenarios:

    • We don't want the dependency at all, in which case we should define a Maven Enforcer rule to ban it.
    • We want to include some version of the dependency but don't want to explicitly specify a fixed version. For instance, imagine that libraries A and B depend on some dependency C. We might exclude C from library B so that library A's version of C becomes the final effective version.
    • We want to include the dependency only when certain profiles are enabled, so we ban it from all of our transitive dependencies and explicitly re-add it in a profile.
  • Remove explicitly-promoted transitive dependencies from Hive POM: in an early revision of SPARK-8064, build against Hive 1.2.1 #7191, it looks like we tried to use the core-classified version of hive-exec, which is published as a thin-JAR which does not bundle third-party dependencies' classes. However, both the regular and classified versions of the dependency share the same effective POM, which, in this case, happens to be the dependency-reduced-POM produced by the Maven shade plugin. As a result, users of the core-classified JAR must add direct dependencies on what are logically transitive dependencies of hive-exec, so SPARK-8064, build against Hive 1.2.1 #7191 added several such dependencies in hive/pom.xml.

    However, by the end of SPARK-8064, build against Hive 1.2.1 #7191 we abandoned the idea of using an existing Hive JAR and chose to republish our own JAR, so this promotion of transitive dependencies is no longer necessary. As a result, this patch was able to significantly cut down the size of the hive POM by removing those direct dependencies. This had the side-effect of exposing a problem where one of our Janino JARs was being pulled to a lower version as a side-effect of this unnecessary dependency promotion.

  • Use Maven's new wildcard exclude support: Maven now supports wildcard exclusion patterns, so I used that to simplify the exclusion of org.ow2.asm artifacts.

  • Improve dev/test-dependencies: a bad regex caused the org.spark-project:hive JARs to not be listed under dev/deps; this is now fixed.

  • Remove explicit io.netty dependency: we don't seem to need this anymore. I did leave the dependencyManagement entry for now in order to prevent an unexpected downgrade.

/cc @srowen @steveloughran @vanzin @pwendell @rxin for review.

JoshRosen added 29 commits April 4, 2016 17:05
It seems that the explicit dep. was pulling our version of Janino's commons-compiler
a bit lower than it should have been.
@JoshRosen
Copy link
Contributor Author

Oh, /cc @vanzin as well.

@JoshRosen
Copy link
Contributor Author

[warn]  module not found: org.pentaho#pentaho-aggdesigner-algorithm;5.1.5-jhyde
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
[warn] ==== Maven2 Local: tried
[warn]   file:/home/sparkivy/per-executor-caches/6/.m2/repository/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
[warn] ==== local: tried
[warn]   /home/sparkivy/per-executor-caches/6/.ivy2/local/org.pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/ivys/ivy.xml
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.pentaho#pentaho-aggdesigner-algorithm;5.1.5-jhyde: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn] 
[warn]  Note: Unresolved dependencies path:
[warn]      org.pentaho:pentaho-aggdesigner-algorithm:5.1.5-jhyde
[warn]        +- org.apache.calcite:calcite-core:1.2.0-incubating
[warn]        +- org.spark-project.hive:hive-exec:1.2.1.spark ((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]        +- org.apache.spark:spark-yarn_2.11:2.0.0-SNAPSHOT

It looks like compilation failed because the pentaho-aggdesigner-algorithm transitive dependency isn't present in Maven Central; see also https://mail-archives.apache.org/mod_mbox/hive-dev/201412.mbox/%3C571034233.2432062.1419887855285.JavaMail.yahoo@jws10685.mail.bf1.yahoo.com%3E

This is a great illustration of why Spark's build bans the use of dependencies which aren't present in Maven Central: non-central repos can go offline and break a previously-good build.

I did keep the exclusion which is supposed to be preventing this dep. from being pulled in, so I wonder whether this is a problem specific to the SBT build. Let me try re-running with Maven to see if our POM reader plugin could be to blame.

@JoshRosen JoshRosen changed the title [SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM [SPARK-14399][test-maven] Remove unnecessary excludes from POMs and simplify Hive POM Apr 5, 2016
@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-enforcer-plugin</artifactId>
<version>1.4.1</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need to repeat the version here

@srowen
Copy link
Member

srowen commented Apr 5, 2016

Great cleanup!

@JoshRosen
Copy link
Contributor Author

From the Jenkins Maven output:

[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-hive_2.11 ---
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
2/2 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom (2 KB at 4.0 KB/sec)
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom
4/9 KB   
8/9 KB   
9/9 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner/5.1.5-jhyde/pentaho-aggdesigner-5.1.5-jhyde.pom (9 KB at 26.3 KB/sec)
Downloading: https://repo1.maven.org/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar

Downloading: http://www.datanucleus.org/downloads/maven2/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar

Downloading: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar
4/48 KB   
7/48 KB   
11/48 KB   
14/48 KB   
17/48 KB   
21/48 KB   
25/48 KB   
29/48 KB   
33/48 KB   
34/48 KB   
38/48 KB   
42/48 KB   
46/48 KB   
48/48 KB   

Downloaded: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar (48 KB at 99.6 KB/sec)

It looks like our Maven build isn't as strict in its enforcement of only using Maven Central.

I guess maybe Maven will still try to resolve excluded transitive dependencies unless you add an explicit direct dependency? The extra download time by itself isn't a huge deal / that's not necessary what we want to optimize for (otherwise we'd just ban all transitive dependencies and explicitly promote everything to top-level). However, I don't want to rely on clojars so I'll look into restoring that direct dep. on calcite.

@JoshRosen JoshRosen changed the title [SPARK-14399][test-maven] Remove unnecessary excludes from POMs and simplify Hive POM [SPARK-14399] Remove unnecessary excludes from POMs and simplify Hive POM Apr 5, 2016
@JoshRosen
Copy link
Contributor Author

org.apache.maven.model.building.ModelBuildingException: 10 problems were encountered while building the effective model for org.apache.spark:spark-hive_2.11:2.0.0-SNAPSHOT
[WARNING] 'dependencies.dependency.exclusions.exclusion.artifactId' for org.spark-project.hive:hive-exec:jar with value '*' does not match a valid id pattern. @ org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT, /Users/joshrosen/Documents/spark2/pom.xml, line 1184, column 25

It looks like our custom sbt-pom-reader fork is using an ancient version of Maven which doesn't support this lovely wildcard exclusion syntax. Bummer! Yet another motivation to move back to a stock version; see https://issues.apache.org/jira/browse/SPARK-14401

@JoshRosen
Copy link
Contributor Author

In fact, being able to use the nice wildcard syntax may be a way's off because the official sbt-pom-reader doesn't support Maven 3.3.x due to API incompatibilities: https://github.com/sbt/sbt-pom-reader/blob/master/project/dependencies.scala#L5

@JoshRosen
Copy link
Contributor Author

Actually, we might be able to use sbt-pom-reader 2.10-RC2, since we only require Maven 3.2.1+ and that artifact uses 3.2.2: sbt/sbt-pom-reader#20 (comment)

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54971 has finished for PR 12171 at commit fc938a0.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54967 has finished for PR 12171 at commit 51853c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor

without looking at the details, I appreciated the ambition —and like that wildcard artifact exclusion.

what is the dependency graph that SBT generates from this?

Joda time is an interesting point ... we've seen some problems related to spark + aws having joda time mismatches. the amazon aws/s3/kinesis libs do require joda time, and there was some conflict creeping in there between the hadoop installation and kinesis imports in the spark job.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54972 has finished for PR 12171 at commit 36430de.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

It looks like Hive might actually need Joda time:

- analyze MetastoreRelations *** FAILED ***
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/joda/time/ReadWritableInstant
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:455)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:440)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:226)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:173)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:172)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:215)
  at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:440)
  at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:430)
  at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:351)
  at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:183)
  ...
01:36:40.566 ERROR hive.ql.exec.DDLTask: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyPrimitiveObjectInspectorFactory

https://stackoverflow.com/questions/26259717/facing-java-lang-noclassdeffounderror-org-joda-time-readableinstant-error-even

@JoshRosen
Copy link
Contributor Author

I re-added joda-time; let's see if that fixes Hive.

@@ -840,14 +827,13 @@
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>${avro.version}</version>
<!-- use the build matching the hadoop api of avro-mapred (i.e. no classifier for hadoop 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this anymore? We don't support hadoop1 in Spark 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; we can just hardcode hadoop2 here for now.

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2016

LGTM if sbt and maven tests pass.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55002 has finished for PR 12171 at commit 0bb11be.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Hmm, the FlumePollingStreamSuite test looks broken because it has hung twice now. I'll try to reproduce offline.

@ash211
Copy link
Contributor

ash211 commented Apr 8, 2016

@JoshRosen Guava and Jetty excludes might have been done when they were shaded:
https://issues.apache.org/jira/browse/SPARK-2848 - shade Guava
https://issues.apache.org/jira/browse/SPARK-3996 - shade Jetty

@JoshRosen
Copy link
Contributor Author

Now that we no longer publish assembly JARs for Spark, I don't think that shading is going to require exclusions because the only reason to do that would be if we wanted to completely prevent Jetty and Guava from appearing in our transitive dependency classpath, a task which currently isn't possible unless we republish all of our deps. which depend on Guava and Jetty.

@JoshRosen JoshRosen closed this Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants