-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537
Conversation
Test build #30391 has finished for PR 5537 at commit
|
CC @andrewor14 That makes some sense. I hope there aren't side effects to setting this globally in this process. |
@srowen This code was used in |
@SaintBacchus It's fine to set it to 0, but maybe it makes more sense to just remove the logic that retries port in the first place since the shuffle service is the only place that binds to ports in the whole subproject. More specifically I'm referring to |
Given that the network-common library (where the retry logic is) is supposed to be generic, and not just for shuffle, I'd rather keep the retry logic there, so this change looks better to me. I would just avoid setting the value in the passed configuration, preferring to clone it instead (or some other way that does not modify the input conf). Perhaps a similar change in the non-YARN service (does that exist?) would be needed too. That being said, the javadoc for |
…nfiguration in YarnShuffleService
@andrewor14 |
Test build #30473 has finished for PR 5537 at commit
|
@@ -43,8 +43,12 @@ object SparkTransportConf { | |||
* @param numUsableCores if nonzero, this will restrict the server and client threads to only | |||
* use the given number of cores, rather than all of the machine's cores. | |||
* This restriction will only occur if these properties are not already set. | |||
* @param disablePortRetry if true, server will not retry its port. It's better for the long-run | |||
* server to disable it since the server and client had the agreement of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: alignment is off by 1
The changes as they currently stand LGTM, let's just fix up the style as @vanzin pointed out. |
I see, it makes sense to remove it for services set up independently of the Spark application, but not for Netty because the executors will still be able to find the port. This workaround seems appropriate then. |
@@ -43,8 +43,12 @@ object SparkTransportConf { | |||
* @param numUsableCores if nonzero, this will restrict the server and client threads to only | |||
* use the given number of cores, rather than all of the machine's cores. | |||
* This restriction will only occur if these properties are not already set. | |||
* @param disablePortRetry if true, server will not retry its port. It's better for the long-run | |||
* server to disable it since the server and client had the agreement of | |||
* the specific port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if true, the server will not retry to bind to a different port than the one configured.
This is better for long-running servers since they may have agreed upon specific
ports with the clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, since the only place we ever set this to true is in StandaloneWorkerShuffleService
, does it make sense to just move the conf setting there along with this comment? (similar to what you did in YARN)
Ah, I realize now that the original change that brought this bug up (#4240) was perhaps not the right solution. The issue that #4240 solved was actually brought up beforehand (#3688 (comment)). I think my recommended fix then would still be good, as it would not affect the Yarn or Standalone Worker shuffle services. I created a patch which aims to resolve both #4240 and this PR's issues: #5575. |
It's better to let the NodeManager get down rather than take a port retry when
spark.shuffle.service.port
has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port.