Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

[PAN-2305] Detect stalled world state downloads #875

Merged
merged 5 commits into from
Feb 18, 2019

Conversation

mbaxter
Copy link
Contributor

@mbaxter mbaxter commented Feb 15, 2019

PR description

This PR updates WorldStateDownloader to track failure counts for each node requested from the network. If a node is retried more than a configurable number of times, the download fails.

There is one potential issue with this approach. When a node request fails, it is added back to the queue and so moves to the "end of the line". So we could end up doing a lot of processing before we retry the node enough times to finally decide that the current world state isn't available. If this ends up being problematic, one fix might be to track failed nodes in a separate queue so we can process them more quickly.

@mbaxter mbaxter requested a review from ajsutton February 15, 2019 21:43
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The simpler option I'd trialled was to simply count the number of requests since we last received a node. It wasn't bad but tuning the limit was challenging. I think this is probably a better approach since it's looking specifically for a node that's no longer available which is the more accurate signal.

@mbaxter mbaxter merged commit de6a382 into PegaSysEng:master Feb 18, 2019
rain-on pushed a commit to rain-on/pantheon that referenced this pull request Feb 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants