-
Notifications
You must be signed in to change notification settings - Fork 130
[NC-1273] Start of fast sync downloader #613
Conversation
… full sync once that completes. Currently the FastSyncDownloader immediately fails with FAST_SYNC_UNSUPPORTED.
…n of fast sync actions.
…ad of having to check the return value is success constantly.
… it can return a single header, handle retrying and provide a place to do additional validation.
@@ -80,6 +81,7 @@ public void fullSyncFromGenesis() throws Exception { | |||
} | |||
|
|||
@Test | |||
@Ignore("Fast sync implementation in progress.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that NC-2178 has been filed to re-enable this once the configuration options are fully exposed (currently it fails because it's waiting for 5 peers and the test times out before fast sync does).
…ock) agree on the block.
syncConfig, protocolSchedule, protocolContext, ethContext, syncState, ethTasksTimer); | ||
|
||
ChainHeadTracker.trackChainHeadForPeers( | ||
ethContext, protocolSchedule, protocolContext.getBlockchain(), syncConfig, ethTasksTimer); | ||
if (syncConfig.syncMode().equals(SyncMode.FAST)) { | ||
if (syncConfig.syncMode() == SyncMode.FAST) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious - why prefer ==
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purely ease of reading and because I was confused by the .equals
for a bit thinking it wasn't actually an enum.
if (error != null) { | ||
LOG.error("Fast sync failed. Switching to full sync.", error); | ||
} | ||
if (!result.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a new pattern - we usually return any errors via the throwable. I think this pattern is a little counter-intuitive because the future can appear to complete "successfully" (non-exceptionally) when it actually failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it was more meaningful in earlier drafts and got whittled away to this which isn't particularly useful at all. Switched to exceptions.
LOG.warn( | ||
"Maximum wait time for fast sync reached but no peers available. Continuing to wait for any available peer."); | ||
final WaitForPeerTask waitForPeerTask = WaitForPeerTask.create(ethContext, ethTasksTimer); | ||
return ethContext.getScheduler().scheduleSyncWorkerTask(waitForPeerTask::run); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you scheduling this in the worker pool? Also, I don't think we should be running any tasks that can't timeout. If we can't find any peers for some reason, we could just hang here waiting forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't find any peers then there is literally nothing we can do but wait. If we added a timeout, when it's reached we'd simply have to go back to waiting for a peer again, so why not just keep waiting? We can't even fall back to full sync because it needs a peer to sync from as well.
It went to the worker pool for consistency but it doesn't actually need to. Changed to call directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly I'm just thinking about how debugging would go in this case. It might look like the synchronizer has hung if it just stops printing any logs. If we did add a timeout, I would just go back and start a new WaitForPeerTask
when the timeout triggered, but at least it's more obvious that it's still trying to sync. Not a huge deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point. I've changed it to repeatedly wait for peers so it prints the "Waiting for 1 peers" message repeatedly while it's waiting.
.handle( | ||
(waitResult, error) -> { | ||
if (ExceptionUtils.rootCause(error) instanceof TimeoutException) { | ||
if (ethContext.getEthPeers().bestPeer().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you just checking that we have any peers here? If so, getEthPeers().availablePeerCount() > 0
might be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
import com.google.common.base.MoreObjects; | ||
|
||
public class FastSyncState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another FastSyncState
in tech.pegasys.pantheon.ethereum.eth.sync.state
- these 2 classes should be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the old one which wasn't used anywhere.
return fastSyncActions | ||
.waitForSuitablePeers() | ||
.thenApply(state -> fastSyncActions.selectPivotBlock()) | ||
.thenCompose(fastSyncActions::downloadPivotBlockHeader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about how the full sync downloader works and how that maps over to this downloader. I think for the most part the flow should be the same. In the full sync downloader, the first thing we do is find the "sync target". We use this sync target to make sure that we're downloading a coherent chain (downloading the chain based off of hashes from our best peer rather than block numbers). I think we probably want to download the pivot block based on a header that we get from our target peer. We can still "confirm" that block header, but the confirmation would be based on the header / hash rather than block number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risk is that we'll pick a header that's on a non-canonical chain. If we request it by hash, other clients will likely have it and return it "confirming" it, but what we actually want is to check that they consider the canonical block at that number to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that when you fast-sync to a block, you can't apply any re-org that goes back before the fast-sync point. This was a problem in the Ropsten fork where geth clients wound up fast-syncing off Parity nodes and so got stuck on the incorrect chain with no ability to re-org back even when the correct chain eventually had a greater total difficulty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.
.filter(peer -> peer.chainState().getEstimatedHeight() >= pivotBlockNumber) | ||
.collect(Collectors.toList()); | ||
|
||
final int confirmationsRequired = peersToQuery.size() / 2 + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to have a cap on how many peers we want to query for this. For example, if we have 50 peers, do we really need 25 confirmations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to be as certain as possible that it's the right node before fast syncing - it's a big point of weakness because we're trusting that the pivot block we select is correct, we can't perform any real checks that it actually is right. I was actually wondering if we should abort if any of our peers returned a header that didn't match and maybe have a minimum number of peers that confirm the block (currently we might have 10 peers connected but 9 of them haven't reached the pivot block yet).
} | ||
|
||
@Override | ||
protected boolean isRetryableError(final Throwable error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably fold the implementation of this method into AbstractRetryingPeerTask
along with the assignPeer()
functionality in CompleteBlocksTask
and that implemenation of isRetryableError
. The difference in that implementation is that if there is an assigned peer, and that peer is missing or errors out there's no reason to retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
this.pivotBlockNumber = pivotBlockNumber; | ||
} | ||
|
||
public static GetPivotBlockHeaderFromPeerTask forPivotBlock( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this task more generically - something like RetryingGetHeadersFromPeerByNumberTask
since the functionality isn't actually tied to the pivot. Better yet, you could probably make a completely generic retrying wrapper that takes a supplier for the retryable task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a generic retrying wrapper would be nice but I've had a couple of goes and keep hitting issues making it work. It's starting to introduce delegation vs the current inheritance approach which I generally like but it keeps leading to more knock on effects. I'm not keen to take on that big a change as part of this work and I suspect the better approach is to just move to something like RxJava which appears to give us a lot of benefit.
For now I have renamed this to RetryingGetHeaderFromPeerByNumberTask
.
"Fast sync timed out before minimum peer count was reached. Continuing with reduced peers."); | ||
result.complete(null); | ||
} else { | ||
waitForAnyPeer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would just have the action fail at this point and get handled at a higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is you wind up having to duplicate most of this handle
lambda at the higher level so you'd basically have to have this in FastSyncDownloader
:
private CompletableFuture<Void> ensurePeersAvailable() {
final CompletableFuture<Void> result = new CompletableFuture<>();
fastSyncActions
.waitForSuitablePeers()
.handle(
(waitResult, error) -> {
if (ExceptionUtils.rootCause(error) instanceof TimeoutException) {
fastSyncActions
.waitForAnyPeer()
.thenAccept(result::complete)
.exceptionally(
taskError -> {
result.completeExceptionally(error);
return null;
});
} else if (error != null) {
result.completeExceptionally(error);
} else {
result.complete(null);
}
return null;
});
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd just cut the waitForAnyPeer()
part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit of a toss up. The penalty for not waiting is a month long full sync so it's unlikely the user would want that, but maybe they'd want to be strict about waiting for a minimum number of peers.
We probably should also allow the user to specifically set a pivot block by hash and we wait indefinitely until that block is available. That would be the most secure option since then users can guarantee they sync onto a chain they actually trust.
…n optional error return. Remove unused FastSyncState.
… it can be reused.
return fastSyncActions | ||
.waitForSuitablePeers() | ||
.thenApply(state -> fastSyncActions.selectPivotBlock()) | ||
.thenCompose(fastSyncActions::downloadPivotBlockHeader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.
"Fast sync timed out before minimum peer count was reached. Continuing with reduced peers."); | ||
result.complete(null); | ||
} else { | ||
waitForAnyPeer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd just cut the waitForAnyPeer()
part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.
…eer to connect to avoid the appearance of hanging forever.
PR description
Introduces a
FastSyncDownloader
and the first few steps of the fast sync process. Specifically:SynchronizerConfiguration
SynchronizerConfiguration
Together this gets us to the point where the world state download could be started if we wanted to see how it integrates and be able to test it in the real client.
The
--sync-mode
flag has been re-enabled but set to hidden so that we can test fast sync progress without it being fully exposed as a supported and working option yet. If fast sync fails or reaches the end of what we've implemented it falls back to full sync.