Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5006][Deploy]spark.port.maxRetries doesn't work #3841

Closed
wants to merge 11 commits into from

Conversation

WangTaoTheTonic
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-5006

I think the issue is produced in #1777.

Not digging mesos's backend yet. Maybe should add same logic either.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24890 has started for PR 3841 at commit 62ec336.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24890 has finished for PR 3841 at commit 62ec336.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24890/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24891 has started for PR 3841 at commit 191face.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24891 has finished for PR 3841 at commit 191face.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24891/
Test PASSed.

@@ -176,6 +176,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
logInfo(s"Running Spark version $SPARK_VERSION")

private[spark] val conf = config.clone()
val portRetriesConf = conf.getOption("spark.port.maxRetries")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use conf.getOption(...).foreach { portRetriesConf => [...] } but I'm not sure that it's a huge win.

@JoshRosen
Copy link
Contributor

I'm a bit confused about this chance, since it seems like changing the code to read that value from system properties instead of SparkConf breaks our ability to configure it via SparkConf.

Can you add a failing unit test which demonstrates the problem / bug that this patch addresses?

If this issue has to do with initialization ordering, I'd like to see if we can come up with a cleaner approach which doesn't involve things like unexplained lazy keywords (since I'm concerned that such approaches will inevitably break when the code is modified).

@WangTaoTheTonic
Copy link
Contributor Author

I set spark.port.maxRetries to 1 in spark-defaults.conf and ran 3 SparkPi example using yarn-client mode in one node. Logs in 3 drivers' side were:

14/12/31 10:09:26 INFO Utils: Successfully started service 'SparkUI' on port 23000.
14/12/31 10:09:26 INFO SparkUI: Started SparkUI at http://dc1-rack1-host2:23000

14/12/31 10:09:33 WARN Utils: Service 'SparkUI' could not bind on port 23000. Attempting port 23001.
14/12/31 10:09:34 INFO Utils: Successfully started service 'SparkUI' on port 23001.
14/12/31 10:09:34 INFO SparkUI: Started SparkUI at http://dc1-rack1-host2:23001

14/12/31 10:09:36 WARN Utils: Service 'SparkUI' could not bind on port 23000. Attempting port 23001.
14/12/31 10:09:36 WARN Utils: Service 'SparkUI' could not bind on port 23001. Attempting port 23002.
14/12/31 10:09:36 INFO Utils: Successfully started service 'SparkUI' on port 23002.
14/12/31 10:09:36 INFO SparkUI: Started SparkUI at http://dc1-rack1-host2:23002

We could see the third application retried 3 times before it giving up. So the spark.port.maxRetries didn't work here.
In another node, I set this config's value to 20 and it could only launch 16 applications.

14/12/31 17:43:55 WARN Utils: Service 'SparkUI' could not bind on port 23015.
14/12/31 17:43:55 ERROR SparkUI: Failed to bind SparkUI
java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24928 has started for PR 3841 at commit 396c226.

  • This patch merges cleanly.

@WangTaoTheTonic
Copy link
Contributor Author

After this patch, the logs of running 3 application with spark.port.maxRetries=1 were:

14/12/31 11:02:06 INFO Utils: Starting service 'SparkUI' on port 23000 with maximum 1 retries.
14/12/31 11:02:06 INFO Utils: Successfully started service 'SparkUI' on port 23000.
14/12/31 11:02:06 INFO SparkUI: Started SparkUI at http://dc1-rack1-host2:23000

14/12/31 11:02:11 INFO Utils: Starting service 'SparkUI' on port 23000 with maximum 1 retries.
14/12/31 11:02:11 WARN Utils: Service 'SparkUI' could not bind on port 23000. Attempting port 23001.
14/12/31 11:02:11 INFO Utils: Successfully started service 'SparkUI' on port 23001.
14/12/31 11:02:11 INFO SparkUI: Started SparkUI at http://dc1-rack1-host2:23001

14/12/31 11:02:28 INFO Utils: Starting service 'SparkUI' on port 23000 with maximum 1 retries.
14/12/31 11:02:28 WARN Utils: Service 'SparkUI' could not bind on port 23000. Attempting port 23001.
14/12/31 11:02:28 ERROR SparkUI: Failed to bind SparkUI
java.net.BindException: Address already in use: Service 'SparkUI' failed after 1 retries!
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:344)
at sun.nio.ch.Net.bind(Net.java:336)

And I oberved other port on driver and executors, they acted as expected too.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24928 has finished for PR 3841 at commit 396c226.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24928/
Test FAILed.

@WangTaoTheTonic
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24943 has started for PR 3841 at commit 396c226.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24943 has finished for PR 3841 at commit 396c226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24943/
Test PASSed.

@WangTaoTheTonic
Copy link
Contributor Author

@andrewor14 Could you take a look?

.flatMap(_.conf.getOption("spark.port.maxRetries"))
.map(_.toInt)
.getOrElse(16)
sys.props.get("spark.port.maxRetries").map(_.toInt).getOrElse(16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that changing this to read from System properties rather than the SparkEnv will break standalone mode here. Have you confirmed that spark.port.maxRetries in standalone mode still works with this change?

In general we should prefer using SparkEnv over System properties because they have all kind of gotchas. There's a reason we created SparkEnv in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should read from the conf, not from sys.props directly

@ash211
Copy link
Contributor

ash211 commented Jan 5, 2015

@WangTaoTheTonic I understand the problem you're observing and think it's related specifically to YARN. Without being super familiar with YARN, I think the approach this patch should take is to send the config option via a system property with a -D flag to the executor, and immediately put that System property into a stub SparkEnv object. When the real SparkEnv object comes over the wire, the two can be merged (or maybe the first can be dropped). This way we don't have to change to reading from System properties in other places, like Utils.maxPortRetries

@@ -372,5 +372,5 @@ private[spark] object SparkConf {
/**
* Return whether the given config is a Spark port config.
*/
def isSparkPortConf(name: String): Boolean = name.startsWith("spark.") && name.endsWith(".port")
def isSparkPortConf(name: String): Boolean = name.startsWith("spark.") && name.contains(".port")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are misunderstanding what this method does. It looks for spark.*.port intentionally and should not match spark.port.maxRetries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should send spark.port.maxRetries to the executor before it is launched then the config will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but this currently matches something like spark.portable.mushroom which has nothing to do with Spark ports. Maybe instead you want to do something like:

name.matches("spark\\..*\\.port") | name.startsWith("spark.port.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed but used a more obvious way.

@andrewor14
Copy link
Contributor

So @WangTaoTheTonic @JoshRosen does changing the portMaxRetries to a def solve this issue? I don't see a reason to go through the system properties.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25414 has started for PR 3841 at commit bc6e1ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25410 has finished for PR 3841 at commit 67bcb46.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25410/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25414 has finished for PR 3841 at commit bc6e1ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25414/
Test PASSed.

@WangTaoTheTonic
Copy link
Contributor Author

@andrewor14 Updated and tested simply, it works.

@@ -57,7 +57,7 @@ private[spark] class HttpServer(
} else {
logInfo("Starting HTTP Server")
val (actualServer, actualPort) =
Utils.startServiceOnPort[Server](requestedPort, doStart, serverName)
Utils.startServiceOnPort[Server](requestedPort, doStart, new SparkConf(), serverName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass this a conf that already exists instead of creating a new one. This requires changing the constructors of HttpServer and HttpFileServer. I believe there is already a conf everywhere else.

@andrewor14
Copy link
Contributor

@WangTaoTheTonic this looks much better than before. My new comments are fairly minor.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25453 has started for PR 3841 at commit 8cdf96d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25453 has finished for PR 3841 at commit 8cdf96d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • val classServer = new HttpServer(conf, outputDir, new SecurityManager(conf), classServerPort, "HTTP class server")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25453/
Test PASSed.

*/
def isSparkPortConf(name: String): Boolean = name.startsWith("spark.") && name.endsWith(".port")
def isSparkPortConf(name: String): Boolean = {
(name.startsWith("spark.") && name.endsWith(".port")) | name.startsWith("spark.port.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be ||!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this myself.

@andrewor14
Copy link
Contributor

Ok LGTM I will fix the last batch of comments myself when I merge this into master. Thanks.

@asfgit asfgit closed this in f7741a9 Jan 13, 2015
@JoshRosen
Copy link
Contributor

@andrewor14 Should this be backported to 1.2.1?

asfgit pushed a commit that referenced this pull request Jan 21, 2015
https://issues.apache.org/jira/browse/SPARK-5006

I think the issue is produced in #1777.

Not digging mesos's backend yet. Maybe should add same logic either.

Author: WangTaoTheTonic <[email protected]>
Author: WangTao <[email protected]>

Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits:

8cdf96d [WangTao] indent thing
2d86d65 [WangTaoTheTonic] fix line length
7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor
61a370d [WangTaoTheTonic] some minor fixes
bc6e1ec [WangTaoTheTonic] rebase
67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments
f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg
29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable
396c226 [WangTaoTheTonic] make the grammar more like scala
191face [WangTaoTheTonic] invalid value name
62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work

Conflicts:
	external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala
@andrewor14
Copy link
Contributor

Ok I also put it in branch-1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants