-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups #20923
[SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups #20923
Conversation
…rage artifacts and binding Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723
Test build #88670 has finished for PR 20923 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked at the build stuff so far...
hadoop-cloud/pom.xml
Outdated
|
||
<build> | ||
<plugins> | ||
<!-- Include a source dir depending on the Scala version --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really based on the Scala version right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad. Cut and paste error. Will make explicit what it's really doing.
hadoop-cloud/pom.xml
Outdated
</executions> | ||
</plugin> | ||
</plugins> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
hadoop-cloud/pom.xml
Outdated
|
||
<!-- | ||
There's now a hadoop-cloud-storage which transitively pulls in the store JARs, | ||
but it still needs some selective exclusion across versions, especially 3.0.x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a little on why the exclusions are needed? Some look a bit suspicious.
e.g. hadoop-common is already pulled transitively by other parts of Spark (including this very module, which does so via hadoop-client), so I'm not sure why the explicit exclusion is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluding hadoop-client means there's no need to worry about any of the stuff explicitly excluded from hadoop-client in the spark root pom (asm/asm, jackson, etc).
Hadoop 3.0.1 declares hadoop-client as a compile time dependency of hadoop-cloud-storage
From 3.0.2+ it's been cut down to provided, and added azure-datalake
as a dependency commit 3c03672e, so it's complete w.r.t ASF connectors.
There's also a fix for the aws shaded SDK to exclude netty HADOOP-15264, because of aws-sdk-java/issues/1488.
The individual hadoop cloud modules (hadoop-aws, hadoop-azure, ...) have also downgraded hadoop-client to being provided, so if you pull in any of those, you will only get the extra artifacts needed to connect to the relevant cloud endpoint, and are expected to pull in the same hadoop-client version elsewhere for things to work.
Here's the dependency list for spark-hadoop-cloud and 3.0.2-SNAPSHOT; 3.1 will be the same unless there's a last minute update to one of the external SDKs or jetty.
[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile
[INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile
[INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile
[INFO] | | \- org.jdom:jdom:jar:1.1:compile
[INFO] | +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile
[INFO] | | \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile
[INFO] | +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile
[INFO] | | +- com.microsoft.azure:azure-storage:jar:5.4.0:compile
[INFO] | | | \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile
[INFO] | | \- org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile
[INFO] | +- org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile
[INFO] | | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile
[INFO] | \- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile
Given that Hadoop 3.0.2+ is downgrading hadoop-client to provided, and that's the minimum version this patch will build against, then the exclusion is mostly superfluous: there to block regressions than actually keep it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, tried taking it out. Turns out that the hadoop-allyun module still declares hadoop-common as provided, so it gets. in. The exclusion is needed, and I've filed HADOOP-15354. I don't know if that'll get into Hadoop 3.1.0 tho'
hadoop-cloud/pom.xml
Outdated
<profile> | ||
<id>hadoop-2.6</id> | ||
<activation> | ||
<activeByDefault>true</activeByDefault> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activeByDefault
is a little misleading. It only enables the profile if you don't explicitly activate any other profiles.
So if you enable any other profile in the build, this won't be enabled automatically. And since the cloud module itself is already under a profile, I don't think you can ever trigger this.
Probably will need to be documented in the build docs, or maybe you can think of a different solution like enabling the cloud profile via a property instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. There's another option which is to leave all those in the standard list, and you get a few extra dependencies which aren't needed for the 3.x line:
[INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1:compile *
[INFO] | \- com.fasterxml.jackson.core:jackson-core:jar:2.6.7:compile *
[INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7:compile *
[INFO] +- com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.6.7:compile *
[INFO] +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] | +- commons-logging:commons-logging:jar:1.2:compile
[INFO] | \- commons-codec:commons-codec:jar:1.10:compile
[INFO] +- org.apache.httpcomponents:httpcore:jar:4.4.8:compile
[INFO] +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile
[INFO] | \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile
[INFO] +- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile
[INFO] +- joda-time:joda-time:jar:2.9.3:compile *
[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile
[INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile
[INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile
[INFO] | | \- org.jdom:jdom:jar:1.1:compile
[INFO] | +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile
[INFO] | | +- com.microsoft.azure:azure-storage:jar:5.4.0:compile
[INFO] | | | \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile
[INFO] | | \- org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile
[INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile
[INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile
the jackson-dataformat-cbor
is the funny one; This is the sole declaration within spark. With the shaded aws JAR then it's not needed at all.
The rest all make their way to the spark assembly through other routes.
What do you think? Leave them as the default and not worry about it? It would remove the duplication in the 2.7 profile, and apart from the extraneousness on hadoop-3 builds, harmless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's ok as an initial step. It would be better if you could, in profiles, customize individual dependencies (e.g. in the hadoop-3 profile exclude some transitive deps), but I'm not sure whether maven would complain about something like that.
jackson-dataformat-cbor
can become interesting if Spark decides to upgrade jackson, since the github for that project says it's been removed in 2.8.
* hadoop branch-2 dependencies always declared * minor nits in POM addressed * added log4j.properties for tests Change-Id: Ibb64b20a0be8624d1709e592b9fe85bdc4dd1af7
Test build #88713 has finished for PR 20923 at commit
|
pom.xml
Outdated
<profile> | ||
<id>hadoop-3</id> | ||
<properties> | ||
<hadoop.version>3.1.0-SNAPSHOT</hadoop.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @steveloughran what is the possible release date for Hadoop 3.1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RC0 is up for testing right now! @leftnoteasy is managing the release
I think we could separate cloud related stuffs to another PR, and fix only build related stuff in this PR. |
Also I think we need to create a related spark-deps-hadoop-3.x under dev/deps and make dependency check work for Hadoop 3. |
OK |
… the build with all the POM changes other than those adding the optional hadoop-3.02+ source tree to the spark-hadoop-cloud build Change-Id: Iccc2b66602db05db132ce5cf5c8546fe9a13a3fa
Change-Id: Ic13caf5fcf96d617085051579ede8380b2106119
@jerryshao the latest revision only has the POM changes, and that also excludes the build profile option to compile the hadoop-3 source trees It also switches the hadoop 3.1 version to being 3.1.0, which matches that of the current RC, and will be downloaded from the ASF staging repo if the For example
Remember that if you do download the staged artifacts, then if new RCs are cut then you will need to purge the RCs from |
Test build #88845 has finished for PR 20923 at commit
|
sbt isn't going to test this profile, obviously. Ran both the mvn and sbt package targets qith profiles hadoop-3,hadoop-cloud,yarn,Psnapshots-and-staging |
Hi @steveloughran , I think you missed this comment. You need to create a deps file under dev/deps and change the related script.
|
I saw that, but given there isn't much in the way of a 2.8 profile though it was more of a wish list than a requirement. How do I go about creating it? |
This includes the profile in test-dependencies.sh, so this part of the build will work: hive doesn't need to be working to build that dependency graph. Change-Id: I1ecfd4b1a8bea26600765b1de59f2425c42f6b03
Test build #88941 has finished for PR 20923 at commit
|
Failure is the test dependencies failing as the checker is trying to pull in hadoop-3.1.0 & its still in ASF staging
I'll turn off that check now, leaving the dependency list as is |
….1 is still in staging Change-Id: Id2d5655088b2a8c2bdec43f7d17110a513be3f7c
43e3f31
to
7c93d98
Compare
Test build #88953 has finished for PR 20923 at commit
|
Test build #88955 has finished for PR 20923 at commit
|
Test failures are all in |
Sorry @steveloughran for the late response. most of the deps in file is similar to "spark-deps-hadoop-2.7", so copy/rename it and run "test-dependencies.sh" will show you the diffs, based on the diffs to update the deps file will get a new hadoop3 deps file. |
I think you should also update "test-dependencies.sh" to make the new deps file work. |
Jenkins, retest this please. |
Test build #89038 has finished for PR 20923 at commit
|
I did, but then things failed because the artifacts were only visible if you did a run with the -Psnapshots-and-staging profile, which jenkins doesn't do here -it's why test run 88953 failed. The 3.1.0 artifacts are now up on the mvnrepo, so I can turn that on. Because its out, I think it'd also be good to call the profile hadoop-3.1, to make clear that's the target release, and to avoid confusion if anyone ever creates a 3.2, 3.3, ... profile in future. |
…hadoop 3.1 is still in staging" This reverts commit 7c93d98.
…t-dependencies.sh knows about it Change-Id: Ie4906e2f41e9992e803674dce283f03b4dbab67e
Test build #89058 has finished for PR 20923 at commit
|
Test failures are org.apache.spark.sql.sources.BucketedWriteWithoutHiveSupportSuite.; SPARK-23894 |
Ping @vanzin @gatorsmile , would like to hear your comments. Thanks! |
I should add that the spark-shell doesn't bring up the Azure client, though it's happy with the rest, because of jetty-utils not making into dist/jars...I fear this is shading related. Here's the JAR I'm using to do diagnostics of CP setup there; it tries to bootstrap access from searching for & loading the JARs into actually reading and writing the data. Significantly more informative when things don't work. https://github.com/steveloughran/cloudstore/releases/tag/tag_2018_04_11_release |
jetty-util and jetty-util-ajax are forced into the dist/jars directory by explicit identification in the relevant POMs as in the hadoop-dist-scope. Without this they weren't coming in as spark-assembly was seeing jetty-util marked as provided. It's not needed for the spark-* JARs, which all use the shaded reference, but it is needed indirectly via hadoop-azure. This change to the poms reinstates it. Maven has proven surprisingly "fussy" here; the implication being its "closest declaration wins" resolution policy doesn't just control versions, it has influence over scoping. Change-Id: I081023cae84236c925fad4e94168f1dac5a8026a
The jetty problem has been dealt with; because of the shading declaration of jetty-util as provided (it isn't needed in spark any more), it wasn't getting into dist/jars even for those dependencies which did needed it. Fix: explicitly declaring it in the hadoop-cloud module and the hadoop cloud profile in the assembly module. The jetty-ajax dependency, which is the direct dependency of it from hadoop-azure, is also declared so that the spark build can keep in control of jetty versions everywhere. That one was getting through fine, because it wasn't being tagged as provided. |
Test build #89295 has finished for PR 20923 at commit
|
@jerryshao comments? I know without the patched hive or mutant hadoop build Spark doesn't work with Hadoop 3, but this sets everything up to build consistently, which is a prerequisite to fixing up the semi-official spark hive JAR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor things.
Redeclare this dependency to force it into the distribution. | ||
--> | ||
<dependency> | ||
<groupId>org.eclipse.jetty</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kinda sucks. Doesn't this also end up pulling up a bunch of other jetty stuff into the packaging?
I guess there's no way around it until Hadoop itself shades jetty in some way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. Doesn't this also end up pulling up a bunch of other jetty stuff into the packaging?
It doesn't pull in anything else. There's already one of the jetty- JARs in the dist/jars directory BTW.
I guess there's no way around it until Hadoop itself shades jetty in some way...
Or when @aajisaka & colleagues implement the Java 9 support and everyone runs to it. This is one of those examples why, from a packaging and deployment perspective, Java 9 is the good one
Created HADOOP-15387 for the shading task, put my name to it as Bikas has already been expressing a desire for it
<dependencies> | ||
<!--used during compilation but not exported as transitive dependencies--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed after you removed the committer code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
@@ -38,7 +38,32 @@ | |||
<sbt.project.name>hadoop-cloud</sbt.project.name> | |||
</properties> | |||
|
|||
<build> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed after you removed the committer code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in an adjacent PR, I've just pulled in all the POM dependency changes to keep everything related to the dependency digraph in this one so it can be audited in one go.
@vanzin : The followup to this is #21066; I could move the compile time changes there but if you are going to have POMs playing with dependencies, seems best to have it all in one place...the other one just setting up the compile and tests @jerryshao what do you suggest? It was your proposal to split things into pom and source for ease of reviewal, after all? |
@@ -2671,6 +2671,15 @@ | |||
</properties> | |||
</profile> | |||
|
|||
<profile> | |||
<id>hadoop-3.1</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for skipping Hadoop 3.0 and starts to support Hadoop 3.1+ only.
retest this please |
I asked for the tests to run once more just to be sure. I'd have preferred to keep the code-related build changes in the other PR where the code is actually added. But since the PR is already up, that would just slow things down, so this is fine as long as tests pass. |
I would guess the test here doesn't actually run on Hadoop 3 profile. So we actually doesn't test anything. Also we still cannot use Hadoop3 even if we merge this because of Hive issue. Unless we use some tricks mentioned above. So I'm not sure if we should address Hive issue first. |
+1 for @jerryshao 's comment. Some of Hive UTs will fail with Hadoop 3 profile. |
Test build #89748 has finished for PR 20923 at commit
|
I can and do build Hadoop with this local version enabled, so it's easy enough to set things up locally, Indeed the ability to change Hadoop version, HADOOP-13852 came about precisely because I was the first person to try and do that hadoop-3+spark test, with a precursor profile for this Get this in and things are set up for the hive work, as everything else is ready for it. We've decoupled the work, and for those people who do have a compatible hadoop/hive setup, then this provides a standard profile for them to use, instead of having to write their own, determine zookeeper and curator versions, etc, etc. |
I've also added a comment to SPARK-18673 offering to fix the org.spark-project.hive JAR, but only once this patch is in. This bit is ready, that may time, and it will need this setup so that the actual tests can be run. |
Merging to master. |
thank you! I guess that means I'm down for the hive JAR, doesn't it :) Better make a list of patches which should go in, I think internally we have 1+ kerberos related (pwendell/hive#2) as well as the Hadoop version case statement. |
jersey-server-2.22.2.jar | ||
jets3t-0.9.4.jar | ||
jetty-webapp-9.3.20.v20170531.jar | ||
jetty-xml-9.3.20.v20170531.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but what do you mean? Apache Spark 2.4.5 Hadoop 2.7 binary has jetty
jars while Apache Spark 3.0.0 Hadoop 3.2 binary does not.
$ tar tvf spark-2.4.5-bin-hadoop2.7.tgz | grep jetty
-rw-r--r-- spark-rm/spark-rm 177131 2020-01-13 02:30 spark-2.4.5-bin-hadoop2.7/jars/jetty-util-6.1.26.jar
-rw-r--r-- spark-rm/spark-rm 539912 2020-01-13 02:30 spark-2.4.5-bin-hadoop2.7/jars/jetty-6.1.26.jar
$ tar tvf spark-3.0.0-bin-hadoop3.2.tgz | grep jetty
$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see [SPARK-30051][BUILD] Clean up hadoop-3.2 dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dbtsai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other required jetty jars are still shaded correctly. Please let me know if there is something missed.
$ jar tvf spark-core_2.12-3.0.0.jar | grep jetty | wc -l
1308
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thank you for answering the question.
What changes were proposed in this pull request?
hadoop-3.1
profile build depending on the hadoop-3.1 artifacts.How was this patch tested?
The spark hive-1.2.1 JAR has problems here, as it's version check logic fails for Hadoop versions > 2.
This can be avoided with either of
mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11
. This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment).I've done both, with maven and SBT.
Three issues surfaced