-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23640][CORE] Fix hadoop config may override spark config #20785
Conversation
Test build #88123 has finished for PR 20785 at commit
|
retest this please |
The fix is not correct. If you want to change the order of precedence of these configs, you need to change Just to double check, this has always worked like this, right? I checked both 2.2 and 2.3 and both choose the YARN configuration over the Spark configuration when it's set. |
Test build #88130 has finished for PR 20785 at commit
|
Test build #88152 has finished for PR 20785 at commit
|
retest this please |
Test build #88155 has finished for PR 20785 at commit
|
I think if you're running on yarn, semantically |
You are right. |
@@ -2434,7 +2434,8 @@ private[spark] object Utils extends Logging { | |||
*/ | |||
def getSparkOrYarnConfig(conf: SparkConf, key: String, default: String): String = { | |||
val sparkValue = conf.get(key, default) | |||
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") { | |||
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
The logic you want here is the equivalent of:
if conf.contains(key)
get spark conf
elif is_running_on_yarn()
get conf from yarn
else
return default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that --conf spark.shuffle.service.port = 7338
is configured, 7338 is displayed on the tab of the environment, but 7337 is actually used.
So my idea is get value from SparkConf
if key starting with spark.
except for spark.hadoop.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YarnConfiguration
can only configure one spark.shuffle.service.port
value.
We can gradually upgrade the shuffle service if get spark.shuffle.service.port
value from SparkConf
because we can set different values for different applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what you're saying, but let me explain how the configuration is expected to work.
"spark." options are set in "SparkConf". "spark.hadoop.*" options, on top of those, should also be reflected in any Hadoop Configuration
objects that are created.
So you should never need to directly reference "spark.hadoop." properties in Spark code. They are not meant to be used by Spark, they are meant to be Hadoop configs. That's why I'm saying your code should not be doing what it is doing.
From what I understand of what you're trying to do, you want "spark.shuffle.service.port" to have precedence over the YARN configuration. For that, you just do what I suggested above. Check whether it's set in the Spark configuration before you even look at any Hadoop configuration.
The current order of precedence should be:
- spark.hadoop.spark.shuffle.service.port (since it overrides Hadoop config)
- hadoop config (spark.shuffle.service.port set in xml files)
- spark.shuffle.service.port
You're proposing moving the lowest one to the top. That's a simple change. If you're trying to also fix something else, then it means there's a problem in another place.
Test build #88331 has finished for PR 20785 at commit
|
Ping @vanzin |
val sparkValue = conf.get(key, default) | ||
if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") { | ||
new YarnConfiguration(SparkHadoopUtil.get.newConfiguration(conf)).get(key, sparkValue) | ||
if (conf.contains(key)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaladoc above is now wrong, since it still refers to the old order of precedence. Otherwise looks ok.
Test build #88750 has finished for PR 20785 at commit
|
Merging to master. |
## What changes were proposed in this pull request? It may be get `spark.shuffle.service.port` from https://github.com/apache/spark/blob/9745ec3a61c99be59ef6a9d5eebd445e8af65b7a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L459 Therefore, the client configuration `spark.shuffle.service.port` does not working unless the configuration is `spark.hadoop.spark.shuffle.service.port`. - This configuration is not working: ``` bin/spark-sql --master yarn --conf spark.shuffle.service.port=7338 ``` - This configuration works: ``` bin/spark-sql --master yarn --conf spark.hadoop.spark.shuffle.service.port=7338 ``` This PR fix this issue. ## How was this patch tested? It's difficult to carry out unit testing. But I've tested it manually. Author: Yuming Wang <[email protected]> Closes apache#20785 from wangyum/SPARK-23640.
What changes were proposed in this pull request?
It may be get
spark.shuffle.service.port
fromspark/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
Line 459 in 9745ec3
Therefore, the client configuration
spark.shuffle.service.port
does not working unless the configuration isspark.hadoop.spark.shuffle.service.port
.This PR fix this issue.
How was this patch tested?
It's difficult to carry out unit testing. But I've tested it manually.