Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23640][CORE] Fix hadoop config may override spark config #20785

Closed
wants to merge 4 commits into from
Closed

[SPARK-23640][CORE] Fix hadoop config may override spark config #20785

wants to merge 4 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Mar 9, 2018

What changes were proposed in this pull request?

It may be get spark.shuffle.service.port from

val hadoopConf = new Configuration()

Therefore, the client configuration spark.shuffle.service.port does not working unless the configuration is spark.hadoop.spark.shuffle.service.port.

  • This configuration is not working:
bin/spark-sql --master yarn --conf spark.shuffle.service.port=7338
  • This configuration works:
bin/spark-sql --master yarn --conf spark.hadoop.spark.shuffle.service.port=7338

This PR fix this issue.

How was this patch tested?

It's difficult to carry out unit testing. But I've tested it manually.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88123 has finished for PR 20785 at commit 9745ec3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Mar 9, 2018

retest this please

@vanzin
Copy link
Contributor

vanzin commented Mar 9, 2018

The fix is not correct. If you want to change the order of precedence of these configs, you need to change Utils.getSparkOrYarnConfig.

Just to double check, this has always worked like this, right? I checked both 2.2 and 2.3 and both choose the YARN configuration over the Spark configuration when it's set.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88130 has finished for PR 20785 at commit 9745ec3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2018

Test build #88152 has finished for PR 20785 at commit 0034a58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Mar 10, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2018

Test build #88155 has finished for PR 20785 at commit 0034a58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

I think if you're running on yarn, semantically spark.shuffle.service.port is a yarn configuration specified in yarn-site.xml. So it seems correct from semantic point.

@wangyum
Copy link
Member Author

wangyum commented Mar 12, 2018

You are right.
In fact, our cluster has two shuffle services, one for production and one for development. We configure spark.shuffle.service.port to decide which shuffle service to use.

@@ -2434,7 +2434,8 @@ private[spark] object Utils extends Logging {
*/
def getSparkOrYarnConfig(conf: SparkConf, key: String, default: String): String = {
val sparkValue = conf.get(key, default)
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") {
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

The logic you want here is the equivalent of:

if conf.contains(key)
  get spark conf
elif is_running_on_yarn()
  get conf from yarn
else
  return default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that --conf spark.shuffle.service.port = 7338 is configured, 7338 is displayed on the tab of the environment, but 7337 is actually used.
So my idea is get value from SparkConf if key starting with spark. except for spark.hadoop..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YarnConfiguration can only configure one spark.shuffle.service.port value.
We can gradually upgrade the shuffle service if get spark.shuffle.service.port value from SparkConf because we can set different values ​​for different applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what you're saying, but let me explain how the configuration is expected to work.

"spark." options are set in "SparkConf". "spark.hadoop.*" options, on top of those, should also be reflected in any Hadoop Configuration objects that are created.

So you should never need to directly reference "spark.hadoop." properties in Spark code. They are not meant to be used by Spark, they are meant to be Hadoop configs. That's why I'm saying your code should not be doing what it is doing.

From what I understand of what you're trying to do, you want "spark.shuffle.service.port" to have precedence over the YARN configuration. For that, you just do what I suggested above. Check whether it's set in the Spark configuration before you even look at any Hadoop configuration.

The current order of precedence should be:

  • spark.hadoop.spark.shuffle.service.port (since it overrides Hadoop config)
  • hadoop config (spark.shuffle.service.port set in xml files)
  • spark.shuffle.service.port

You're proposing moving the lowest one to the top. That's a simple change. If you're trying to also fix something else, then it means there's a problem in another place.

@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88331 has finished for PR 20785 at commit 06bb6f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Mar 29, 2018

Ping @vanzin

val sparkValue = conf.get(key, default)
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") {
new YarnConfiguration(SparkHadoopUtil.get.newConfiguration(conf)).get(key, sparkValue)
if (conf.contains(key)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaladoc above is now wrong, since it still refers to the old order of precedence. Otherwise looks ok.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88750 has finished for PR 20785 at commit a1eb874.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 30, 2018

Merging to master.

@asfgit asfgit closed this in ae91720 Mar 30, 2018
mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
## What changes were proposed in this pull request?

It may be get `spark.shuffle.service.port` from https://github.com/apache/spark/blob/9745ec3a61c99be59ef6a9d5eebd445e8af65b7a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L459

Therefore, the client configuration `spark.shuffle.service.port` does not working unless the configuration is `spark.hadoop.spark.shuffle.service.port`.

- This configuration is not working:
```
bin/spark-sql --master yarn --conf spark.shuffle.service.port=7338
```
- This configuration works:
```
bin/spark-sql --master yarn --conf spark.hadoop.spark.shuffle.service.port=7338
```

This PR fix this issue.

## How was this patch tested?

It's difficult to carry out unit testing. But I've tested it manually.

Author: Yuming Wang <[email protected]>

Closes apache#20785 from wangyum/SPARK-23640.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants