Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2677][SPARK-2717]BasicBlockFetchIterator#next can wait forever #1619

Closed
wants to merge 1 commit into from

Conversation

witgo
Copy link
Contributor

@witgo witgo commented Jul 28, 2014

No description provided.

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17280/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17280/consoleFull

@pwendell
Copy link
Contributor

Can you create a test for this? I'm not sure what happens here if the timeout is encountered.

@sarutak
Copy link
Member

sarutak commented Jul 29, 2014

@witgo @pwendell I have already noticed there is not a configuration for timeout for ConnectionManager, but the timeout for ConnectionManager does not resolve this issue because the channel used by receiving ack is implemented as non blocking I.O and SO_TIMEOUT is effects read after establishing connection. So, if remote executor hangs, it cannot establish connections with fetching executors.

Additionally, BasicBlockFetcherIterator is wait on LinkedBlockingQueue#take (result.take) so we should set FetchResult object which size is -1 to result queue of BasicBlockFetcherIterator.
(FetchResult which size is -1 means fetch failed)

I think remote errors can be classified following 2 cases.

  1. Remote Executor hang
    In this case, we need timeout for Fetch Request (Not read timeout)
    I'm trying to resolve this case in [SPARK-2677] BasicBlockFetchIterator#next can wait forever #1632

  2. Remote Executor not hang but error occurred
    In this case, remote executor should send message which means error occurred in remote Executor.
    I'm trying to resolve this case in [SPARK-2583] ConnectionManager cannot distinguish whether error occurred or not #1490
    This is ongoing.
    Can anyone review this too?

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17350/consoleFull

@witgo
Copy link
Contributor Author

witgo commented Jul 29, 2014

@sarutak I think add a heartbeat detection mechanism is a good solution

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17350/consoleFull

@witgo
Copy link
Contributor Author

witgo commented Jul 29, 2014

@sarutak ConnectionManager.scala#L259 to deal with the situation of connection cannot be established.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17356/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17356/consoleFull

@witgo witgo changed the title [WIP][SPARK-2677]BasicBlockFetchIterator#next can wait forever [SPARK-2677][SPARK-2717]BasicBlockFetchIterator#next can wait forever Jul 30, 2014
@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17449/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17450/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17451/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17449/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17450/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17451/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17580/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17580/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17583/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17583/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1619. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18097/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1619:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18097/consoleFull

@witgo witgo closed this Aug 17, 2014
@witgo witgo deleted the SPARK-2677 branch August 17, 2014 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants