Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints #5584

Closed
wants to merge 2 commits into from

Conversation

aarondav
Copy link
Contributor

Currently we rely on the assumption that an exception will be raised and the channel closed if two endpoints cannot communicate over a Netty TCP channel. However, this guarantee does not hold in all network environments, and SPARK-6962 seems to point to a case where only the server side of the connection detected a fault.

This patch improves robustness of fetch/rpc requests by having an explicit timeout in the transport layer which closes the connection if there is a period of inactivity while there are outstanding requests.

NB: This patch is actually only around 50 lines added if you exclude the testing-related code.

@aarondav
Copy link
Contributor Author

cc @rxin

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30572 has finished for PR 5584 at commit aa5278b.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class MapConfigProvider extends ConfigProvider
  • This patch does not change any dependencies.

…een Netty block transfer service endpoints

Currently we rely on the assumption that an exception will be raised and the channel closed if two endpoints cannot communicate over a Netty TCP channel. However, this guarantee does not hold in all network environments, and [SPARK-6962](https://issues.apache.org/jira/browse/SPARK-6962) seems to point to a case where only the server side of the connection detected a fault.

This patch improves robustness of fetch/rpc requests by having an explicit timeout in the transport layer which closes the connection if there is a period of inactivity while there are outstanding requests.
@@ -50,13 +50,17 @@

private final Map<Long, RpcResponseCallback> outstandingRpcs;

private AtomicLong timeOfLastRequestNs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u document the semantics of this, and how it'd be used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and final?

@aarondav
Copy link
Contributor Author

All comments addressed.

@rxin
Copy link
Contributor

rxin commented Apr 20, 2015

LGTM.

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30573 has finished for PR 5584 at commit 37ce656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class MapConfigProvider extends ConfigProvider
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30575 has finished for PR 5584 at commit 8699680.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class MapConfigProvider extends ConfigProvider
  • This patch does not change any dependencies.

@aarondav
Copy link
Contributor Author

Merging into master. We may consider backporting to 1.3 if it turns out this fixes SPARK-6962, which is a pretty nasty bug for those who run into it.

@asfgit asfgit closed this in 968ad97 Apr 20, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…een Netty block transfer service endpoints

Currently we rely on the assumption that an exception will be raised and the channel closed if two endpoints cannot communicate over a Netty TCP channel. However, this guarantee does not hold in all network environments, and [SPARK-6962](https://issues.apache.org/jira/browse/SPARK-6962) seems to point to a case where only the server side of the connection detected a fault.

This patch improves robustness of fetch/rpc requests by having an explicit timeout in the transport layer which closes the connection if there is a period of inactivity while there are outstanding requests.

NB: This patch is actually only around 50 lines added if you exclude the testing-related code.

Author: Aaron Davidson <[email protected]>

Closes apache#5584 from aarondav/timeout and squashes the following commits:

8699680 [Aaron Davidson] Address Reynold's comments
37ce656 [Aaron Davidson] [SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants