[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

tgravescs · 2016-08-19T14:23:57Z

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled". Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it. Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

…ling upgrade

SparkQA · 2016-08-19T16:18:50Z

Test build #64072 has finished for PR 14718 at commit f0a5c56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class LevelDBProvider
- public static class StoreVersion
- public static class AppId

SparkQA · 2016-08-19T17:15:00Z

Test build #64074 has finished for PR 14718 at commit 6db1ad6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…6711

tgravescs · 2016-08-19T18:14:40Z

need to update the test to handle the new levedb

SparkQA · 2016-08-19T21:19:05Z

Test build #64094 has finished for PR 14718 at commit 2643d56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-08-20T20:19:11Z

Moving the jackson/leveldb dependencies isn't going to create problems on the yarn shuffle CP are they? Given the versions aren't changing, I'm not too worried —I just want to make sure

tgravescs · 2016-08-22T14:04:17Z

No, it all gets including into one assembly jar used by the nodemanagers (/spark-${project.version}-yarn-shuffle.jar)

tgravescs · 2016-08-29T19:06:21Z

ping @vanzin

vanzin · 2016-08-30T00:06:43Z

common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java

+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.base.Charsets;


Use StandardCharsets instead.

vanzin · 2016-08-30T00:20:10Z

I was gonna complain about moving the dependency, but it seems like leveldb already leaks to the user's classpath, so well, damage is already done.

After reading the patch it kinda feels like using leveldb for this is a little overkill though, since it seems you just need to keep a list of executors.

steveloughran · 2016-08-30T09:47:47Z

LevelDB is JNI so you can't shade it; there's been some careful review so that YARN NMs and Spark shuffle are in sync here. It's jackson versions which break things.

tgravescs · 2016-08-30T14:21:24Z

thanks for the review, I'll fix up based on the comments. I don't follow your question or how you think the list of executors is acceptable. This is doing authentication, you need the secret to properly authenticate. Either way you have to store it somewhere and we already use leveldb which performs well so I'm not sure the concern.

vanzin · 2016-08-30T16:13:04Z

how you think the list of executors is acceptable

Ignore me, brain was fried from reading too many patches. Yes you need to record the secret because this is the shuffle service and it manages multiple applications...

vanzin · 2016-08-30T21:47:59Z

common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.File;


nit: java should come before others.

vanzin · 2016-08-30T21:56:46Z

Logic looks ok, I'd just avoid the unneeded work to load the DB when auth is not enabled.

SparkQA · 2016-08-30T23:04:24Z

Test build #64674 has finished for PR 14718 at commit 0e39687.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-08-31T17:32:25Z

Looks good. It'd be nice to update YarnShuffleServiceSuite to cover this scenario (e.g. make sure the app's secret is available before initializeApplication is called).

SparkQA · 2016-08-31T17:32:31Z

Test build #64722 has finished for PR 14718 at commit c4f58e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-08-31T18:51:16Z

Jenkins, test this please

SparkQA · 2016-08-31T20:37:54Z

Test build #64729 has finished for PR 14718 at commit c4f58e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-08-31T20:46:36Z

we seem to be having random test failures.
I'll see if I can add a test for this.

tgravescs · 2016-09-01T19:21:47Z

I remember now why I hadn't added tests, I was a bit hesitant to expose the secretManager. I'll add a basic sanity test to make the file is set or not to make sure it was init'd in time.

tgravescs · 2016-09-01T19:21:52Z

Jenkins, test this please

SparkQA · 2016-09-01T21:10:58Z

Test build #64796 has finished for PR 14718 at commit c4f58e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-01T21:48:45Z

Test build #64797 has finished for PR 14718 at commit 5319981.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-09-02T17:41:51Z

I thought you were going to merge it yourself, but since you didn't... merging to master / 2.0.

vanzin · 2016-09-02T17:42:41Z

No luck merging to 2.0.

tgravescs · 2016-09-06T13:43:08Z

thanks, I'll put up a separate pr for branch-2.0

…ling upgrade The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled". Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it. Adding another leveldb was simpler from the code structure point of view. Most of the code changes are moving things to common util class. Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds. Author: Thomas Graves <[email protected]> Closes apache#14718 from tgravescs/SPARK-16711. Conflicts: common/network-shuffle/pom.xml common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

zsxwing · 2017-07-10T20:01:06Z

common/network-common/pom.xml

@@ -45,6 +45,22 @@
      <artifactId>commons-lang3</artifactId>
    </dependency>

+    <dependency>
+      <groupId>org.fusesource.leveldbjni</groupId>


@tgravescs Why not move LevelDBProvider and this dependency to common/network-shuffle/? It's not used in common/network-common/.

What do you mean? looking below its used in common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java

Oh maybe I misunderstood, why not you mean move it all there? I think the only reason was it was a utility class that could be used by others. Both network-shuffle and network-yarn use it now.

I think network-yarn depends on network-shuffle, so the utility class can be put into network-shuffle.

NVM. Just recall that @vanzin may want to use leveldb in the history server.

That's sort of orthogonal, though. My code pulls leveldb directly, not transitively through the network libs.

Thomas Graves added 2 commits August 18, 2016 18:59

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rol…

f0a5c56

…ling upgrade

remove unused imports

6db1ad6

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

8b4ba9d

…6711

Close the leveldb on stop

2643d56

vanzin reviewed Aug 30, 2016
View reviewed changes

review comments

0e39687

vanzin reviewed Aug 30, 2016
View reviewed changes

move logic to createSecretManager and some review comments

c4f58e8

Add basic unit test

5319981

asfgit closed this in e79962f Sep 2, 2016

zsxwing reviewed Jul 10, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

tgravescs commented Aug 19, 2016

SparkQA commented Aug 19, 2016

SparkQA commented Aug 19, 2016

tgravescs commented Aug 19, 2016

SparkQA commented Aug 19, 2016

steveloughran commented Aug 20, 2016

tgravescs commented Aug 22, 2016

tgravescs commented Aug 29, 2016

vanzin Aug 30, 2016

vanzin commented Aug 30, 2016

steveloughran commented Aug 30, 2016

tgravescs commented Aug 30, 2016

vanzin commented Aug 30, 2016

vanzin Aug 30, 2016

vanzin commented Aug 30, 2016

SparkQA commented Aug 30, 2016

vanzin commented Aug 31, 2016

SparkQA commented Aug 31, 2016

tgravescs commented Aug 31, 2016

SparkQA commented Aug 31, 2016

tgravescs commented Aug 31, 2016

tgravescs commented Sep 1, 2016

tgravescs commented Sep 1, 2016

SparkQA commented Sep 1, 2016

SparkQA commented Sep 1, 2016

vanzin commented Sep 2, 2016

vanzin commented Sep 2, 2016

tgravescs commented Sep 6, 2016

zsxwing Jul 10, 2017

tgravescs Jul 11, 2017

tgravescs Jul 11, 2017

zsxwing Jul 11, 2017

zsxwing Jul 11, 2017

vanzin Jul 11, 2017

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

Conversation

tgravescs commented Aug 19, 2016

SparkQA commented Aug 19, 2016

SparkQA commented Aug 19, 2016

tgravescs commented Aug 19, 2016

SparkQA commented Aug 19, 2016

steveloughran commented Aug 20, 2016

tgravescs commented Aug 22, 2016

tgravescs commented Aug 29, 2016

Choose a reason for hiding this comment

vanzin commented Aug 30, 2016

steveloughran commented Aug 30, 2016

tgravescs commented Aug 30, 2016

vanzin commented Aug 30, 2016

Choose a reason for hiding this comment

vanzin commented Aug 30, 2016

SparkQA commented Aug 30, 2016

vanzin commented Aug 31, 2016

SparkQA commented Aug 31, 2016

tgravescs commented Aug 31, 2016

SparkQA commented Aug 31, 2016

tgravescs commented Aug 31, 2016

tgravescs commented Sep 1, 2016

tgravescs commented Sep 1, 2016

SparkQA commented Sep 1, 2016

SparkQA commented Sep 1, 2016

vanzin commented Sep 2, 2016

vanzin commented Sep 2, 2016

tgravescs commented Sep 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment