-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2889] Create Hadoop config objects consistently. #1843
Conversation
This is the basic grunt work; code doesn't fully compile yet, since I'll do some of the more questionable changes in separate commits.
Instead of using "new Configuration()" where a configuration is needed, let the caller provide a context-appropriate config object.
This is sort of hackish, since it doesn't account for any customization someone might make to SparkConf before they actually start executing spark code. Instead, this will only consider options available in the system properties when creating the hadoop conf.
BTW: I'd like to add a couple of simple tests for the YarnSparkHadoopUtil class, but #1724 adds the test suite for that class and I'll wait until that PR is merged before adding the tests. |
QA tests have started for PR 1843. This patch merges cleanly. |
QA results for PR 1843: |
python errors only, unrelated? |
Jenkins, retest this please. |
QA tests have started for PR 1843. This patch merges cleanly. |
QA results for PR 1843: |
Conflicts: core/src/main/scala/org/apache/spark/util/FileLogger.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
Jenkins, test this please. |
1 similar comment
Jenkins, test this please. |
QA tests have started for PR 1843 at commit
|
QA tests have finished for PR 1843 at commit
|
@@ -68,7 +68,26 @@ class SparkHadoopUtil extends Logging { | |||
* Return an appropriate (subclass) of Configuration. Creating config can initializes some Hadoop | |||
* subsystems. | |||
*/ | |||
def newConfiguration(): Configuration = new Configuration() | |||
def newConfiguration(conf: SparkConf): Configuration = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a breaking API change, we can't just do it like this. We have to add the old version.
Also, somewhat worryingly, I don't think SparkHadoopUtil was meant to be a public API, so it's weird that it gets used in our examples. We should probably mark it as @DeveloperApi
and make sure that the examples don't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the whole "deploy" package is excluded from mima checks (because I added the exclude at @pwendell's request). How is it documented that these packages are "private", if at all? Do we need explicit annotations in that case?
(http://spark.apache.org/docs/1.0.0/api/scala/#package does not list the package, so maybe that's it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as the rest of the codebase -- everything that is "private" should be marked private[spark]
. Things that we need to make public for advanced developers are @DeveloperApi
. In this case, this thing has been public so we can't remove it, but we could at least mark it to tell people not to depend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW in this case you should mark this class and all its methods as @DeveloperApi
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I added the annotation.
Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala
QA tests have started for PR 1843 at commit
|
QA tests have finished for PR 1843 at commit
|
QA tests have started for PR 1843 at commit
|
QA tests have finished for PR 1843 at commit
|
@vanzin unfortunately this no longer merges cleanly, probably due to your YARN change. Mind rebasing it? |
Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterScheduler.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala
Jenkins, test this please. |
add to whitelist and test this please |
QA tests have started for PR 1843 at commit
|
QA tests have finished for PR 1843 at commit
|
Conflicts: core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala
Jenkins, test this please. |
QA tests have started for PR 1843 at commit
|
QA tests have finished for PR 1843 at commit
|
Thanks Marcelo! I've merged this in. |
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config. This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext. The places that were not changed fall in one of the following categories: - Test code where this doesn't really matter - Places deep in the code where plumbing SparkConf would be too difficult for very little gain - Default values for arguments - since the caller can provide their own config in that case Author: Marcelo Vanzin <[email protected]> Closes apache#1843 from vanzin/SPARK-2889 and squashes the following commits: 52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works. 53f9506 [Marcelo Vanzin] Add DeveloperApi annotation. 3d345cb [Marcelo Vanzin] Restore old method for backwards compat. fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 3f26760 [Marcelo Vanzin] Compilation fix. f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil. b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument. 1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.
This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.
The places that were not changed fall in one of the following categories: