Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups #20923

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Mar 28, 2018

What changes were proposed in this pull request?

  1. Adds a hadoop-3.1 profile build depending on the hadoop-3.1 artifacts.
  2. In the hadoop-cloud module, adds an explicit hadoop-3.1 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies.
  3. the hadoop-cloud/hadoop-3.1 profile also declares support for jetty-ajax and jetty-util to ensure that these jars get into the distribution jar directory when needed by unshaded libraries.
  4. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies.

How was this patch tested?

  • Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk.
  • spark-shell was used to create connectors to all the stores and verify that file IO could take place.

The spark hive-1.2.1 JAR has problems here, as it's version check logic fails for Hadoop versions > 2.

This can be avoided with either of

  • The hadoop JARs built to declare their version as Hadoop 2.11 mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11 . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment).
  • A modified version of spark hive whose version check switch statement is happy with hadoop 3.

I've done both, with maven and SBT.

Three issues surfaced

  1. A spark-core test failure —fixed in SPARK-23787.
  2. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3.1
  3. Marking jetty-utils as provided in spark was stopping hadoop-azure from being able to instantiate the azure wasb:// client; it was using jetty-util-ajax, which could then not find a class in jetty-util.

…rage artifacts and binding

Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723
@SparkQA
Copy link

SparkQA commented Mar 28, 2018

Test build #88670 has finished for PR 20923 at commit 29e7324.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BindingParquetOutputCommitter(
  • class PathOutputCommitProtocol(

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the build stuff so far...


<build>
<plugins>
<!-- Include a source dir depending on the Scala version -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really based on the Scala version right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad. Cut and paste error. Will make explicit what it's really doing.

</executions>
</plugin>
</plugins>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


<!--
There's now a hadoop-cloud-storage which transitively pulls in the store JARs,
but it still needs some selective exclusion across versions, especially 3.0.x.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a little on why the exclusions are needed? Some look a bit suspicious.

e.g. hadoop-common is already pulled transitively by other parts of Spark (including this very module, which does so via hadoop-client), so I'm not sure why the explicit exclusion is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding hadoop-client means there's no need to worry about any of the stuff explicitly excluded from hadoop-client in the spark root pom (asm/asm, jackson, etc).

Hadoop 3.0.1 declares hadoop-client as a compile time dependency of hadoop-cloud-storage

From 3.0.2+ it's been cut down to provided, and added azure-datalake as a dependency commit 3c03672e, so it's complete w.r.t ASF connectors.
There's also a fix for the aws shaded SDK to exclude netty HADOOP-15264, because of aws-sdk-java/issues/1488.

The individual hadoop cloud modules (hadoop-aws, hadoop-azure, ...) have also downgraded hadoop-client to being provided, so if you pull in any of those, you will only get the extra artifacts needed to connect to the relevant cloud endpoint, and are expected to pull in the same hadoop-client version elsewhere for things to work.

Here's the dependency list for spark-hadoop-cloud and 3.0.2-SNAPSHOT; 3.1 will be the same unless there's a last minute update to one of the external SDKs or jetty.

[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile
[INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile
[INFO] |  |     \- org.jdom:jdom:jar:1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile
[INFO] |  +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  +- com.microsoft.azure:azure-storage:jar:5.4.0:compile
[INFO] |  |  |  \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile
[INFO] |  +- org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile
[INFO] |  \- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile

Given that Hadoop 3.0.2+ is downgrading hadoop-client to provided, and that's the minimum version this patch will build against, then the exclusion is mostly superfluous: there to block regressions than actually keep it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update, tried taking it out. Turns out that the hadoop-allyun module still declares hadoop-common as provided, so it gets. in. The exclusion is needed, and I've filed HADOOP-15354. I don't know if that'll get into Hadoop 3.1.0 tho'

<profile>
<id>hadoop-2.6</id>
<activation>
<activeByDefault>true</activeByDefault>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeByDefault is a little misleading. It only enables the profile if you don't explicitly activate any other profiles.

So if you enable any other profile in the build, this won't be enabled automatically. And since the cloud module itself is already under a profile, I don't think you can ever trigger this.

Probably will need to be documented in the build docs, or maybe you can think of a different solution like enabling the cloud profile via a property instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. There's another option which is to leave all those in the standard list, and you get a few extra dependencies which aren't needed for the 3.x line:

[INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1:compile             *
[INFO] |  \- com.fasterxml.jackson.core:jackson-core:jar:2.6.7:compile                *
[INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7:compile            *
[INFO] +- com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.6.7:compile  *
[INFO] +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] |  +- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  \- commons-codec:commons-codec:jar:1.10:compile
[INFO] +- org.apache.httpcomponents:httpcore:jar:4.4.8:compile
[INFO] +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile
[INFO] |  \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile
[INFO] +- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile
[INFO] +- joda-time:joda-time:jar:2.9.3:compile                                       *
[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile
[INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile
[INFO] |  |     \- org.jdom:jdom:jar:1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile
[INFO] |  |  +- com.microsoft.azure:azure-storage:jar:5.4.0:compile
[INFO] |  |  |  \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile
[INFO] |  \- org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile
[INFO] |     \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile

the jackson-dataformat-cbor is the funny one; This is the sole declaration within spark. With the shaded aws JAR then it's not needed at all.
The rest all make their way to the spark assembly through other routes.

What do you think? Leave them as the default and not worry about it? It would remove the duplication in the 2.7 profile, and apart from the extraneousness on hadoop-3 builds, harmless.

Copy link
Contributor

@vanzin vanzin Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ok as an initial step. It would be better if you could, in profiles, customize individual dependencies (e.g. in the hadoop-3 profile exclude some transitive deps), but I'm not sure whether maven would complain about something like that.

jackson-dataformat-cbor can become interesting if Spark decides to upgrade jackson, since the github for that project says it's been removed in 2.8.

* hadoop branch-2 dependencies always declared
* minor nits in POM addressed
* added log4j.properties for tests

Change-Id: Ibb64b20a0be8624d1709e592b9fe85bdc4dd1af7
@SparkQA
Copy link

SparkQA commented Mar 29, 2018

Test build #88713 has finished for PR 20923 at commit 016d690.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

pom.xml Outdated
<profile>
<id>hadoop-3</id>
<properties>
<hadoop.version>3.1.0-SNAPSHOT</hadoop.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @steveloughran what is the possible release date for Hadoop 3.1.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RC0 is up for testing right now! @leftnoteasy is managing the release

@jerryshao
Copy link
Contributor

I think we could separate cloud related stuffs to another PR, and fix only build related stuff in this PR.

@jerryshao
Copy link
Contributor

Also I think we need to create a related spark-deps-hadoop-3.x under dev/deps and make dependency check work for Hadoop 3.

@steveloughran
Copy link
Contributor Author

steveloughran commented Apr 3, 2018

I think we could separate cloud related stuffs to another PR, and fix only build related stuff in this PR

OK

… the build with all the POM changes other than those adding the optional hadoop-3.02+ source tree to the spark-hadoop-cloud build

Change-Id: Iccc2b66602db05db132ce5cf5c8546fe9a13a3fa
Change-Id: Ic13caf5fcf96d617085051579ede8380b2106119
@steveloughran
Copy link
Contributor Author

@jerryshao the latest revision only has the POM changes, and that also excludes the build profile option to compile the hadoop-3 source trees

It also switches the hadoop 3.1 version to being 3.1.0, which matches that of the current RC, and will be downloaded from the ASF staging repo if the snapshots-and-staging profile is enabled

For example

mvn install -Phadoop-3,hadoop-cloud,yarn,snapshots-and-staging

Remember that if you do download the staged artifacts, then if new RCs are cut then you will need to purge the RCs from ~/.m2/repository

@SparkQA
Copy link

SparkQA commented Apr 3, 2018

Test build #88845 has finished for PR 20923 at commit 58c04e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

sbt isn't going to test this profile, obviously. Ran both the mvn and sbt package targets qith profiles hadoop-3,hadoop-cloud,yarn,Psnapshots-and-staging

@steveloughran steveloughran changed the title [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups Apr 3, 2018
@jerryshao
Copy link
Contributor

Hi @steveloughran , I think you missed this comment. You need to create a deps file under dev/deps and change the related script.

Also I think we need to create a related spark-deps-hadoop-3.x under dev/deps and make dependency check work for Hadoop 3.

@steveloughran
Copy link
Contributor Author

I saw that, but given there isn't much in the way of a 2.8 profile though it was more of a wish list than a requirement. How do I go about creating it?

This  includes the profile in test-dependencies.sh, so this part of the build will work: hive doesn't need to be working to build that dependency graph.

Change-Id: I1ecfd4b1a8bea26600765b1de59f2425c42f6b03
@SparkQA
Copy link

SparkQA commented Apr 5, 2018

Test build #88941 has finished for PR 20923 at commit 4184526.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Failure is the test dependencies failing as the checker is trying to pull in hadoop-3.1.0 & its still in ASF staging

Performing Maven install for hadoop-3
Using `mvn` from path: /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/mvn
[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:spark-440623: Could not find artifact org.apache.hadoop:hadoop-client:jar:3.1.0 in central (https://repo.maven.apache.org/maven2) -> 

I'll turn off that check now, leaving the dependency list as is

….1 is still in staging

Change-Id: Id2d5655088b2a8c2bdec43f7d17110a513be3f7c
@steveloughran steveloughran force-pushed the cloud/SPARK-23807-hadoop-31 branch from 43e3f31 to 7c93d98 Compare April 5, 2018 17:58
@SparkQA
Copy link

SparkQA commented Apr 5, 2018

Test build #88953 has finished for PR 20923 at commit 43e3f31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2018

Test build #88955 has finished for PR 20923 at commit 7c93d98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Test failures are all in org.apache.spark.sql.sources.BucketedWriteWithoutHiveSupportSuite. I don't see how these pom changes could have affected that

@jerryshao
Copy link
Contributor

Sorry @steveloughran for the late response. most of the deps in file is similar to "spark-deps-hadoop-2.7", so copy/rename it and run "test-dependencies.sh" will show you the diffs, based on the diffs to update the deps file will get a new hadoop3 deps file.

@jerryshao
Copy link
Contributor

I think you should also update "test-dependencies.sh" to make the new deps file work.

@jerryshao
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 9, 2018

Test build #89038 has finished for PR 20923 at commit 7c93d98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

steveloughran commented Apr 9, 2018

. I think you should also update "test-dependencies.sh" to make the new deps file work.

I did, but then things failed because the artifacts were only visible if you did a run with the -Psnapshots-and-staging profile, which jenkins doesn't do here -it's why test run 88953 failed.

The 3.1.0 artifacts are now up on the mvnrepo, so I can turn that on.

Because its out, I think it'd also be good to call the profile hadoop-3.1, to make clear that's the target release, and to avoid confusion if anyone ever creates a 3.2, 3.3, ... profile in future.

…hadoop 3.1 is still in staging"

This reverts commit 7c93d98.
…t-dependencies.sh knows about it

Change-Id: Ie4906e2f41e9992e803674dce283f03b4dbab67e
@SparkQA
Copy link

SparkQA commented Apr 9, 2018

Test build #89058 has finished for PR 20923 at commit 52a8c28.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Test failures are org.apache.spark.sql.sources.BucketedWriteWithoutHiveSupportSuite.; SPARK-23894

@jerryshao
Copy link
Contributor

Ping @vanzin @gatorsmile , would like to hear your comments. Thanks!

@steveloughran
Copy link
Contributor Author

I should add that the spark-shell doesn't bring up the Azure client, though it's happy with the rest, because of jetty-utils not making into dist/jars...I fear this is shading related.

Here's the JAR I'm using to do diagnostics of CP setup there; it tries to bootstrap access from searching for & loading the JARs into actually reading and writing the data. Significantly more informative when things don't work.

https://github.com/steveloughran/cloudstore/releases/tag/tag_2018_04_11_release

jetty-util and jetty-util-ajax are forced into the dist/jars directory by
explicit identification in the relevant POMs as in the hadoop-dist-scope.

Without this they weren't coming in as spark-assembly was seeing jetty-util marked
as provided. It's not needed for the spark-* JARs, which all use the shaded reference,
but it is needed indirectly via hadoop-azure. This change to the poms reinstates it.

Maven has proven surprisingly "fussy" here; the implication being its "closest declaration wins"
resolution policy doesn't just control versions, it has influence over scoping.

Change-Id: I081023cae84236c925fad4e94168f1dac5a8026a
@steveloughran
Copy link
Contributor Author

The jetty problem has been dealt with; because of the shading declaration of jetty-util as provided (it isn't needed in spark any more), it wasn't getting into dist/jars even for those dependencies which did needed it.

Fix: explicitly declaring it in the hadoop-cloud module and the hadoop cloud profile in the assembly module.

The jetty-ajax dependency, which is the direct dependency of it from hadoop-azure, is also declared so that the spark build can keep in control of jetty versions everywhere. That one was getting through fine, because it wasn't being tagged as provided.

@steveloughran steveloughran changed the title [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups [SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups Apr 12, 2018
@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89295 has finished for PR 20923 at commit f6b9dc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

@jerryshao comments? I know without the patched hive or mutant hadoop build Spark doesn't work with Hadoop 3, but this sets everything up to build consistently, which is a prerequisite to fixing up the semi-official spark hive JAR.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things.

Redeclare this dependency to force it into the distribution.
-->
<dependency>
<groupId>org.eclipse.jetty</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kinda sucks. Doesn't this also end up pulling up a bunch of other jetty stuff into the packaging?

I guess there's no way around it until Hadoop itself shades jetty in some way...

Copy link
Contributor Author

@steveloughran steveloughran Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. Doesn't this also end up pulling up a bunch of other jetty stuff into the packaging?

It doesn't pull in anything else. There's already one of the jetty- JARs in the dist/jars directory BTW.

I guess there's no way around it until Hadoop itself shades jetty in some way...

Or when @aajisaka & colleagues implement the Java 9 support and everyone runs to it. This is one of those examples why, from a packaging and deployment perspective, Java 9 is the good one

Created HADOOP-15387 for the shading task, put my name to it as Bikas has already been expressing a desire for it

<dependencies>
<!--used during compilation but not exported as transitive dependencies-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed after you removed the committer code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

@@ -38,7 +38,32 @@
<sbt.project.name>hadoop-cloud</sbt.project.name>
</properties>

<build>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed after you removed the committer code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in an adjacent PR, I've just pulled in all the POM dependency changes to keep everything related to the dependency digraph in this one so it can be audited in one go.

@steveloughran
Copy link
Contributor Author

@vanzin : The followup to this is #21066; I could move the compile time changes there but if you are going to have POMs playing with dependencies, seems best to have it all in one place...the other one just setting up the compile and tests

@jerryshao what do you suggest? It was your proposal to split things into pom and source for ease of reviewal, after all?

@@ -2671,6 +2671,15 @@
</properties>
</profile>

<profile>
<id>hadoop-3.1</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for skipping Hadoop 3.0 and starts to support Hadoop 3.1+ only.

@vanzin
Copy link
Contributor

vanzin commented Apr 23, 2018

retest this please

@vanzin
Copy link
Contributor

vanzin commented Apr 23, 2018

I asked for the tests to run once more just to be sure.

I'd have preferred to keep the code-related build changes in the other PR where the code is actually added. But since the PR is already up, that would just slow things down, so this is fine as long as tests pass.

@jerryshao
Copy link
Contributor

I would guess the test here doesn't actually run on Hadoop 3 profile. So we actually doesn't test anything.

Also we still cannot use Hadoop3 even if we merge this because of Hive issue. Unless we use some tricks mentioned above. So I'm not sure if we should address Hive issue first.

@dongjoon-hyun
Copy link
Member

+1 for @jerryshao 's comment. Some of Hive UTs will fail with Hadoop 3 profile.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89748 has finished for PR 20923 at commit f6b9dc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

I can and do build Hadoop with this local version enabled, so it's easy enough to set things up locally, Indeed the ability to change Hadoop version, HADOOP-13852 came about precisely because I was the first person to try and do that hadoop-3+spark test, with a precursor profile for this

Get this in and things are set up for the hive work, as everything else is ready for it. We've decoupled the work, and for those people who do have a compatible hadoop/hive setup, then this provides a standard profile for them to use, instead of having to write their own, determine zookeeper and curator versions, etc, etc.

@steveloughran
Copy link
Contributor Author

steveloughran commented Apr 24, 2018

I've also added a comment to SPARK-18673 offering to fix the org.spark-project.hive JAR, but only once this patch is in. This bit is ready, that may time, and it will need this setup so that the actual tests can be run.

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2018

Merging to master.

@asfgit asfgit closed this in ce7ba2e Apr 24, 2018
@steveloughran
Copy link
Contributor Author

thank you! I guess that means I'm down for the hive JAR, doesn't it :)

Better make a list of patches which should go in, I think internally we have 1+ kerberos related (pwendell/hive#2) as well as the Hadoop version case statement.

jersey-server-2.22.2.jar
jets3t-0.9.4.jar
jetty-webapp-9.3.20.v20170531.jar
jetty-xml-9.3.20.v20170531.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadoop 3.x profile does not shade Jetty any more.

This is different from Hadoop 2.x profile. See #4285.

cc @wangyum @yhuai

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but what do you mean? Apache Spark 2.4.5 Hadoop 2.7 binary has jetty jars while Apache Spark 3.0.0 Hadoop 3.2 binary does not.

$ tar tvf spark-2.4.5-bin-hadoop2.7.tgz | grep jetty
-rw-r--r-- spark-rm/spark-rm   177131 2020-01-13 02:30 spark-2.4.5-bin-hadoop2.7/jars/jetty-util-6.1.26.jar
-rw-r--r-- spark-rm/spark-rm   539912 2020-01-13 02:30 spark-2.4.5-bin-hadoop2.7/jars/jetty-6.1.26.jar

$ tar tvf spark-3.0.0-bin-hadoop3.2.tgz | grep jetty
$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see [SPARK-30051][BUILD] Clean up hadoop-3.2 dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dbtsai

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other required jetty jars are still shaded correctly. Please let me know if there is something missed.

$ jar tvf spark-core_2.12-3.0.0.jar | grep jetty | wc -l
    1308

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for answering the question.

@steveloughran steveloughran deleted the cloud/SPARK-23807-hadoop-31 branch March 9, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants