Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade #14718

Closed
wants to merge 7 commits into from

Conversation

tgravescs
Copy link
Contributor

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled". Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it. Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64072 has finished for PR 14718 at commit f0a5c56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class LevelDBProvider
    • public static class StoreVersion
    • public static class AppId

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64074 has finished for PR 14718 at commit 6db1ad6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

need to update the test to handle the new levedb

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64094 has finished for PR 14718 at commit 2643d56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor

Moving the jackson/leveldb dependencies isn't going to create problems on the yarn shuffle CP are they? Given the versions aren't changing, I'm not too worried —I just want to make sure

@tgravescs
Copy link
Contributor Author

No, it all gets including into one assembly jar used by the nodemanagers (/spark-${project.version}-yarn-shuffle.jar)

@tgravescs
Copy link
Contributor Author

ping @vanzin

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.base.Charsets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use StandardCharsets instead.

@vanzin
Copy link
Contributor

vanzin commented Aug 30, 2016

I was gonna complain about moving the dependency, but it seems like leveldb already leaks to the user's classpath, so well, damage is already done.

After reading the patch it kinda feels like using leveldb for this is a little overkill though, since it seems you just need to keep a list of executors.

@steveloughran
Copy link
Contributor

LevelDB is JNI so you can't shade it; there's been some careful review so that YARN NMs and Spark shuffle are in sync here. It's jackson versions which break things.

@tgravescs
Copy link
Contributor Author

thanks for the review, I'll fix up based on the comments. I don't follow your question or how you think the list of executors is acceptable. This is doing authentication, you need the secret to properly authenticate. Either way you have to store it somewhere and we already use leveldb which performs well so I'm not sure the concern.

@vanzin
Copy link
Contributor

vanzin commented Aug 30, 2016

how you think the list of executors is acceptable

Ignore me, brain was fried from reading too many patches. Yes you need to record the secret because this is the shuffle service and it manages multiple applications...

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: java should come before others.

@vanzin
Copy link
Contributor

vanzin commented Aug 30, 2016

Logic looks ok, I'd just avoid the unneeded work to load the DB when auth is not enabled.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64674 has finished for PR 14718 at commit 0e39687.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 31, 2016

Looks good. It'd be nice to update YarnShuffleServiceSuite to cover this scenario (e.g. make sure the app's secret is available before initializeApplication is called).

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64722 has finished for PR 14718 at commit c4f58e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64729 has finished for PR 14718 at commit c4f58e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

we seem to be having random test failures.
I'll see if I can add a test for this.

@tgravescs
Copy link
Contributor Author

I remember now why I hadn't added tests, I was a bit hesitant to expose the secretManager. I'll add a basic sanity test to make the file is set or not to make sure it was init'd in time.

@tgravescs
Copy link
Contributor Author

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64796 has finished for PR 14718 at commit c4f58e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64797 has finished for PR 14718 at commit 5319981.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 2, 2016

I thought you were going to merge it yourself, but since you didn't... merging to master / 2.0.

@vanzin
Copy link
Contributor

vanzin commented Sep 2, 2016

No luck merging to 2.0.

@asfgit asfgit closed this in e79962f Sep 2, 2016
@tgravescs
Copy link
Contributor Author

thanks, I'll put up a separate pr for branch-2.0

tgravescs pushed a commit to tgravescs/spark that referenced this pull request Sep 6, 2016
…ling upgrade

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled".  Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it.  Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

Author: Thomas Graves <[email protected]>

Closes apache#14718 from tgravescs/SPARK-16711.

Conflicts:
	common/network-shuffle/pom.xml
	common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
	common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@@ -45,6 +45,22 @@
<artifactId>commons-lang3</artifactId>
</dependency>

<dependency>
<groupId>org.fusesource.leveldbjni</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs Why not move LevelDBProvider and this dependency to common/network-shuffle/? It's not used in common/network-common/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? looking below its used in common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe I misunderstood, why not you mean move it all there? I think the only reason was it was a utility class that could be used by others. Both network-shuffle and network-yarn use it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think network-yarn depends on network-shuffle, so the utility class can be put into network-shuffle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM. Just recall that @vanzin may want to use leveldb in the history server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sort of orthogonal, though. My code pulls leveldb directly, not transitively through the network libs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants