[NC-1273] Start of fast sync downloader #613

ajsutton · 2019-01-21T05:34:00Z

PR description

Introduces a FastSyncDownloader and the first few steps of the fast sync process. Specifically:

NC-234 - wait for suitable peers to be available, starting fast sync after either 5 peers are connected or a timeout is reached (both set in SynchronizerConfiguration
NC-2135 - select an appropriate pivot block. Based on pivot distance in SynchronizerConfiguration
NC-2136 - Request the block header. The header is requested from each connected peer (excluding those not sync'd that far along the chain yet) and at least half the peers have to provide the same header before we start fast sync. This provides a little extra confirmation that we're syncing the right chain.

Together this gets us to the point where the world state download could be started if we wanted to see how it integrates and be able to test it in the real client.

The --sync-mode flag has been re-enabled but set to hidden so that we can test fast sync progress without it being fully exposed as a supported and working option yet. If fast sync fails or reaches the end of what we've implemented it falls back to full sync.

… full sync once that completes. Currently the FastSyncDownloader immediately fails with FAST_SYNC_UNSUPPORTED.

…st sync.

…n of fast sync actions.

…ad of having to check the return value is success constantly.

… it can return a single header, handle retrying and provide a place to do additional validation.

ajsutton · 2019-01-21T06:54:31Z

pantheon/src/test/java/tech/pegasys/pantheon/RunnerTest.java

@@ -80,6 +81,7 @@ public void fullSyncFromGenesis() throws Exception {
  }

  @Test
+  @Ignore("Fast sync implementation in progress.")


Note that NC-2178 has been filed to re-enable this once the configuration options are fully exposed (currently it fails because it's waiting for 5 peers and the test times out before fast sync does).

…ock) agree on the block.

mbaxter · 2019-01-22T15:37:40Z

ethereum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/DefaultSynchronizer.java

            syncConfig, protocolSchedule, protocolContext, ethContext, syncState, ethTasksTimer);

    ChainHeadTracker.trackChainHeadForPeers(
        ethContext, protocolSchedule, protocolContext.getBlockchain(), syncConfig, ethTasksTimer);
-    if (syncConfig.syncMode().equals(SyncMode.FAST)) {
+    if (syncConfig.syncMode() == SyncMode.FAST) {


Curious - why prefer == here?

Purely ease of reading and because I was confused by the .equals for a bit thinking it wasn't actually an enum.

mbaxter · 2019-01-22T15:42:31Z

ethereum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/DefaultSynchronizer.java

+    if (error != null) {
+      LOG.error("Fast sync failed. Switching to full sync.", error);
+    }
+    if (!result.isPresent()) {


This looks like a new pattern - we usually return any errors via the throwable. I think this pattern is a little counter-intuitive because the future can appear to complete "successfully" (non-exceptionally) when it actually failed.

True, it was more meaningful in earlier drafts and got whittled away to this which isn't particularly useful at all. Switched to exceptions.

mbaxter · 2019-01-22T16:07:38Z

...reum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncActions.java

+    LOG.warn(
+        "Maximum wait time for fast sync reached but no peers available. Continuing to wait for any available peer.");
+    final WaitForPeerTask waitForPeerTask = WaitForPeerTask.create(ethContext, ethTasksTimer);
+    return ethContext.getScheduler().scheduleSyncWorkerTask(waitForPeerTask::run);


Why are you scheduling this in the worker pool? Also, I don't think we should be running any tasks that can't timeout. If we can't find any peers for some reason, we could just hang here waiting forever.

If we can't find any peers then there is literally nothing we can do but wait. If we added a timeout, when it's reached we'd simply have to go back to waiting for a peer again, so why not just keep waiting? We can't even fall back to full sync because it needs a peer to sync from as well.

It went to the worker pool for consistency but it doesn't actually need to. Changed to call directly.

Mostly I'm just thinking about how debugging would go in this case. It might look like the synchronizer has hung if it just stops printing any logs. If we did add a timeout, I would just go back and start a new WaitForPeerTask when the timeout triggered, but at least it's more obvious that it's still trying to sync. Not a huge deal though.

Yeah that's a good point. I've changed it to repeatedly wait for peers so it prints the "Waiting for 1 peers" message repeatedly while it's waiting.

mbaxter · 2019-01-22T16:09:49Z

...reum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncActions.java

+        .handle(
+            (waitResult, error) -> {
+              if (ExceptionUtils.rootCause(error) instanceof TimeoutException) {
+                if (ethContext.getEthPeers().bestPeer().isPresent()) {


Are you just checking that we have any peers here? If so, getEthPeers().availablePeerCount() > 0 might be clearer.

mbaxter · 2019-01-22T16:39:51Z

ethereum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncState.java

+
+import com.google.common.base.MoreObjects;
+
+public class FastSyncState {


There's another FastSyncState in tech.pegasys.pantheon.ethereum.eth.sync.state - these 2 classes should be merged.

Removed the old one which wasn't used anywhere.

mbaxter · 2019-01-22T17:06:58Z

...m/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncDownloader.java

+    return fastSyncActions
+        .waitForSuitablePeers()
+        .thenApply(state -> fastSyncActions.selectPivotBlock())
+        .thenCompose(fastSyncActions::downloadPivotBlockHeader)


I'm thinking about how the full sync downloader works and how that maps over to this downloader. I think for the most part the flow should be the same. In the full sync downloader, the first thing we do is find the "sync target". We use this sync target to make sure that we're downloading a coherent chain (downloading the chain based off of hashes from our best peer rather than block numbers). I think we probably want to download the pivot block based on a header that we get from our target peer. We can still "confirm" that block header, but the confirmation would be based on the header / hash rather than block number.

The risk is that we'll pick a header that's on a non-canonical chain. If we request it by hash, other clients will likely have it and return it "confirming" it, but what we actually want is to check that they consider the canonical block at that number to be the same.

Bear in mind that when you fast-sync to a block, you can't apply any re-org that goes back before the fast-sync point. This was a problem in the Ropsten fork where geth clients wound up fast-syncing off Parity nodes and so got stuck on the incorrect chain with no ability to re-org back even when the correct chain eventually had a greater total difficulty.

That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.

mbaxter · 2019-01-22T17:10:13Z

...eum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/state/PivotBlockRetriever.java

+            .filter(peer -> peer.chainState().getEstimatedHeight() >= pivotBlockNumber)
+            .collect(Collectors.toList());
+
+    final int confirmationsRequired = peersToQuery.size() / 2 + 1;


I think we might want to have a cap on how many peers we want to query for this. For example, if we have 50 peers, do we really need 25 confirmations?

I think we want to be as certain as possible that it's the right node before fast syncing - it's a big point of weakness because we're trusting that the pivot block we select is correct, we can't perform any real checks that it actually is right. I was actually wondering if we should abort if any of our peers returned a header that didn't match and maybe have a minimum number of peers that confirm the block (currently we might have 10 peers connected but 9 of them haven't reached the pivot block yet).

mbaxter · 2019-01-22T17:21:03Z

...main/java/tech/pegasys/pantheon/ethereum/eth/sync/tasks/GetPivotBlockHeaderFromPeerTask.java

+  }
+
+  @Override
+  protected boolean isRetryableError(final Throwable error) {


I think we should probably fold the implementation of this method into AbstractRetryingPeerTask along with the assignPeer() functionality in CompleteBlocksTask and that implemenation of isRetryableError. The difference in that implementation is that if there is an assigned peer, and that peer is missing or errors out there's no reason to retry.

mbaxter · 2019-01-22T17:25:29Z

...main/java/tech/pegasys/pantheon/ethereum/eth/sync/tasks/GetPivotBlockHeaderFromPeerTask.java

+    this.pivotBlockNumber = pivotBlockNumber;
+  }
+
+  public static GetPivotBlockHeaderFromPeerTask forPivotBlock(


I would name this task more generically - something like RetryingGetHeadersFromPeerByNumberTask since the functionality isn't actually tied to the pivot. Better yet, you could probably make a completely generic retrying wrapper that takes a supplier for the retryable task.

I think a generic retrying wrapper would be nice but I've had a couple of goes and keep hitting issues making it work. It's starting to introduce delegation vs the current inheritance approach which I generally like but it keeps leading to more knock on effects. I'm not keen to take on that big a change as part of this work and I suspect the better approach is to just move to something like RxJava which appears to give us a lot of benefit.

For now I have renamed this to RetryingGetHeaderFromPeerByNumberTask.

mbaxter · 2019-01-22T17:43:00Z

...reum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncActions.java

+                      "Fast sync timed out before minimum peer count was reached. Continuing with reduced peers.");
+                  result.complete(null);
+                } else {
+                  waitForAnyPeer()


I think I would just have the action fail at this point and get handled at a higher level.

The problem with that is you wind up having to duplicate most of this handle lambda at the higher level so you'd basically have to have this in FastSyncDownloader:

private CompletableFuture<Void> ensurePeersAvailable() { final CompletableFuture<Void> result = new CompletableFuture<>(); fastSyncActions .waitForSuitablePeers() .handle( (waitResult, error) -> { if (ExceptionUtils.rootCause(error) instanceof TimeoutException) { fastSyncActions .waitForAnyPeer() .thenAccept(result::complete) .exceptionally( taskError -> { result.completeExceptionally(error); return null; }); } else if (error != null) { result.completeExceptionally(error); } else { result.complete(null); } return null; }); return result; }

I was thinking we'd just cut the waitForAnyPeer() part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.

Yeah it's a bit of a toss up. The penalty for not waiting is a month long full sync so it's unlikely the user would want that, but maybe they'd want to be strict about waiting for a minimum number of peers.

We probably should also allow the user to specifically set a pivot block by hash and we wait indefinitely until that block is available. That would be the most secure option since then users can guarantee they sync onto a chain they actually trust.

…n optional error return. Remove unused FastSyncState.

… it can be reused.

mbaxter · 2019-01-23T16:04:26Z

...m/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncDownloader.java

+    return fastSyncActions
+        .waitForSuitablePeers()
+        .thenApply(state -> fastSyncActions.selectPivotBlock())
+        .thenCompose(fastSyncActions::downloadPivotBlockHeader)


That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.

mbaxter · 2019-01-23T16:13:32Z

...reum/eth/src/main/java/tech/pegasys/pantheon/ethereum/eth/sync/fastsync/FastSyncActions.java

+                      "Fast sync timed out before minimum peer count was reached. Continuing with reduced peers.");
+                  result.complete(null);
+                } else {
+                  waitForAnyPeer()


I was thinking we'd just cut the waitForAnyPeer() part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.

…eer to connect to avoid the appearance of hanging forever.

ajsutton added 8 commits January 21, 2019 15:34

Add support for initiating fast sync to DefaultSynchronizer, starting…

f2d5f68

… full sync once that completes. Currently the FastSyncDownloader immediately fails with FAST_SYNC_UNSUPPORTED.

Wait for a minimum number of peers to be available before starting fa…

cb8cfb2

…st sync.

Add tests for fast sync waiting for peers and the overall coordinatio…

8b10aed

…n of fast sync actions.

Select pivot block.

6f3abb0

Fetch the pivot block header.

27b044c

Switch to throwing an exception to abort the fast sync pipeline inste…

d813ffc

…ad of having to check the return value is success constantly.

waitForSuitablePeers doesn't need to return a FastSyncState.

6ecea9c

Add a basic test for downloadPivotBlockHeader.

89b134d

ajsutton force-pushed the NC-2136 branch from 555ebad to 89b134d Compare January 21, 2019 05:34

ajsutton added 2 commits January 21, 2019 16:37

Create a task specifically for getting the pivot block header so that…

bf5188e

… it can return a single header, handle retrying and provide a place to do additional validation.

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

f67c9f9

ajsutton commented Jan 21, 2019

View reviewed changes

ajsutton and others added 7 commits January 22, 2019 05:40

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

815d385

Add basic tests for GetPivotBlockHeader.

c1e1c2f

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

b7266d1

Add check to ensure that a majority of peers (which have the pivot bl…

32bdd13

…ock) agree on the block.

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

ae0a318

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

7f42f9a

Merge branch 'master' into NC-2136

c43524a

mbaxter reviewed Jan 22, 2019

View reviewed changes

ajsutton added 8 commits January 23, 2019 06:22

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

5329db7

Throw exceptions all the way back out the top instead of mapping to a…

1ac4f4f

…n optional error return. Remove unused FastSyncState.

Move PivotBlockRetriever to the fastsync package.

3598197

Call wait for peers directly instead of sending to the worker pool.

6a7fac0

Simplify check for any available peers.

1bdc38d

Pull isRetryingError and assignPeer up to AbstractRetryingPeerTask so…

1ba9c7f

… it can be reused.

Merge branch 'NC-2136' of github.com:ajsutton/pantheon into NC-2136

599af66

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

d0b6fac

mbaxter approved these changes Jan 23, 2019

View reviewed changes

Merge branch 'master' of github.com:PegaSysEng/pantheon into NC-2136

0f2ec34

Ensure we repeatedly print messages to indicate we're waiting for a p…

e733789

…eer to connect to avoid the appearance of hanging forever.

ajsutton merged commit aa58d67 into PegaSysEng:master Jan 23, 2019

ajsutton deleted the NC-2136 branch January 23, 2019 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NC-1273] Start of fast sync downloader #613

[NC-1273] Start of fast sync downloader #613

ajsutton commented Jan 21, 2019 •

edited

Loading

ajsutton Jan 21, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 23, 2019

ajsutton Jan 23, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 23, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 22, 2019

ajsutton Jan 23, 2019

mbaxter Jan 22, 2019

ajsutton Jan 23, 2019

mbaxter Jan 22, 2019

ajsutton Jan 22, 2019

mbaxter Jan 23, 2019

ajsutton Jan 23, 2019

mbaxter Jan 23, 2019

mbaxter Jan 23, 2019


		import com.google.common.base.MoreObjects;

		public class FastSyncState {

[NC-1273] Start of fast sync downloader #613

[NC-1273] Start of fast sync downloader #613

Conversation

ajsutton commented Jan 21, 2019 • edited Loading

PR description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajsutton commented Jan 21, 2019 •

edited

Loading