[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

SaintBacchus · 2015-04-16T01:57:42Z

It's better to let the NodeManager get down rather than take a port retry when spark.shuffle.service.port has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port.

SparkQA · 2015-04-16T03:33:41Z

Test build #30391 has finished for PR 5537 at commit 7126547.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

srowen · 2015-04-16T09:17:33Z

CC @andrewor14 That makes some sense. I hope there aren't side effects to setting this globally in this process.

SaintBacchus · 2015-04-16T09:32:21Z

@srowen This code was used in YarnShuffleService, any spark application can't get this property.

andrewor14 · 2015-04-16T17:48:39Z

@SaintBacchus It's fine to set it to 0, but maybe it makes more sense to just remove the logic that retries port in the first place since the shuffle service is the only place that binds to ports in the whole subproject. More specifically I'm referring to TransportServer#bindRightPort and TransportConf#portMaxRetries. I believe this is also a problem for standalone mode, so a general fix seems more correct.

vanzin · 2015-04-16T20:48:29Z

Given that the network-common library (where the retry logic is) is supposed to be generic, and not just for shuffle, I'd rather keep the retry logic there, so this change looks better to me. I would just avoid setting the value in the passed configuration, preferring to clone it instead (or some other way that does not modify the input conf).

Perhaps a similar change in the non-YARN service (does that exist?) would be needed too.

That being said, the javadoc for TransportServer.bindRightPort seems wrong; it says it will try to bind to the given port multiple times, but it actually tries different ports.

…nfiguration in YarnShuffleService

SaintBacchus · 2015-04-17T09:00:41Z

@andrewor14 TransportServer#bindRightPort will use in Netty network, in that case have a retry mechanism is a better way.
@vanzin I have clone the configuration and modify the javadoc and also set no retry in StandaloneWorkerShuffleService. Please have a check.

SparkQA · 2015-04-17T09:15:55Z

Test build #30473 has finished for PR 5537 at commit 962770c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

vanzin · 2015-04-17T19:15:41Z

core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala

@@ -43,8 +43,12 @@ object SparkTransportConf {
   * @param numUsableCores if nonzero, this will restrict the server and client threads to only
   *                       use the given number of cores, rather than all of the machine's cores.
   *                       This restriction will only occur if these properties are not already set.
+   * @param disablePortRetry if true, server will not retry its port. It's better for the long-run
+   *                        server to disable it since the server and client had the agreement of


super nit: alignment is off by 1

aarondav · 2015-04-17T19:58:34Z

The changes as they currently stand LGTM, let's just fix up the style as @vanzin pointed out.

andrewor14 · 2015-04-18T02:06:56Z

I see, it makes sense to remove it for services set up independently of the Spark application, but not for Netty because the executors will still be able to find the port. This workaround seems appropriate then.

andrewor14 · 2015-04-18T02:09:55Z

core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala

@@ -43,8 +43,12 @@ object SparkTransportConf {
   * @param numUsableCores if nonzero, this will restrict the server and client threads to only
   *                       use the given number of cores, rather than all of the machine's cores.
   *                       This restriction will only occur if these properties are not already set.
+   * @param disablePortRetry if true, server will not retry its port. It's better for the long-run
+   *                        server to disable it since the server and client had the agreement of
+   *                        the specific port.


if true, the server will not retry to bind to a different port than the one configured.
This is better for long-running servers since they may have agreed upon specific
ports with the clients.

actually, since the only place we ever set this to true is in StandaloneWorkerShuffleService, does it make sense to just move the conf setting there along with this comment? (similar to what you did in YARN)

aarondav · 2015-04-18T18:01:00Z

Ah, I realize now that the original change that brought this bug up (#4240) was perhaps not the right solution. The issue that #4240 solved was actually brought up beforehand (#3688 (comment)). I think my recommended fix then would still be good, as it would not affect the Yarn or Standalone Worker shuffle services.

I created a patch which aims to resolve both #4240 and this PR's issues: #5575.

Do not let Yarn Shuffle Server retry its server port.

7126547

SaintBacchus changed the title ~~Do not let Yarn Shuffle Server retry its server port.~~ [SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. Apr 16, 2015

Disable port retry in StandaloneWorkerShuffleService and clone the Co…

962770c

…nfiguration in YarnShuffleService

vanzin reviewed Apr 17, 2015
View reviewed changes

andrewor14 reviewed Apr 18, 2015
View reviewed changes

SaintBacchus closed this Apr 20, 2015

SaintBacchus deleted the YarnShufflePortRetry branch December 26, 2015 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

SaintBacchus commented Apr 16, 2015

SparkQA commented Apr 16, 2015

srowen commented Apr 16, 2015

SaintBacchus commented Apr 16, 2015

andrewor14 commented Apr 16, 2015

vanzin commented Apr 16, 2015

SaintBacchus commented Apr 17, 2015

SparkQA commented Apr 17, 2015

vanzin Apr 17, 2015

aarondav commented Apr 17, 2015

andrewor14 commented Apr 18, 2015

andrewor14 Apr 18, 2015

andrewor14 Apr 18, 2015

aarondav commented Apr 18, 2015

[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

Conversation

SaintBacchus commented Apr 16, 2015

SparkQA commented Apr 16, 2015

srowen commented Apr 16, 2015

SaintBacchus commented Apr 16, 2015

andrewor14 commented Apr 16, 2015

vanzin commented Apr 16, 2015

SaintBacchus commented Apr 17, 2015

SparkQA commented Apr 17, 2015

vanzin Apr 17, 2015

Choose a reason for hiding this comment

aarondav commented Apr 17, 2015

andrewor14 commented Apr 18, 2015

andrewor14 Apr 18, 2015

Choose a reason for hiding this comment

andrewor14 Apr 18, 2015

Choose a reason for hiding this comment

aarondav commented Apr 18, 2015