harden and speed up block sync #3358

arnetheduck · 2022-02-03T21:38:30Z

The GetBlockBy* server implementation currently reads SSZ bytes from
database, deserializes them into a Nim object then serializes them right
back to SSZ - here, we eliminate the deser/ser steps and send the bytes
straight to the network. Unfortunately, the snappy recoding must still
be done because of differences in framing.

Also, the quota system makes one giant request for quota right before
sending all blocks - this means that a 1024 block request will be
"paused" for a long time, then all blocks will be sent at once causing a
spike in database reads which potentially will see the reading client
time out before any block is sent.

Finally, on the reading side we make several copies of blocks as they
travel through various queues - this was not noticeable before but
becomes a problem in two cases: bellatrix blocks are up to 10mb (instead
of .. 30-40kb) and when backfilling, we process a lot more of them a lot
faster.

fix status comparisons for nodes syncing from genesis (harden status message handling #3327 was a bit
too hard)
don't hit database at all for post-altair slots in GetBlock v1
requests

github-actions · 2022-02-03T22:25:28Z

Unit Test Results

    12 files ±0   806 suites ±0 31m 28s ⏱️ - 2m 5s
1 652 tests ±0 1 606 ✔️ +2   46 💤 - 2 0 ❌ ±0
9 665 runs ±0 9 565 ✔️ +4 100 💤 - 4 0 ❌ ±0

Results for commit adc1682. ± Comparison against base commit 49282e9.

♻️ This comment has been updated with latest results.

The `GetBlockBy*` server implementation currently reads SSZ bytes from database, deserializes them into a Nim object then serializes them right back to SSZ - here, we eliminate the deser/ser steps and send the bytes straight to the network. Unfortunately, the snappy recoding must still be done because of differences in framing. Also, the quota system makes one giant request for quota right before sending all blocks - this means that a 1024 block request will be "paused" for a long time, then all blocks will be sent at once causing a spike in database reads which potentially will see the reading client time out before any block is sent. Finally, on the reading side we make several copies of blocks as they travel through various queues - this was not noticeable before but becomes a problem in two cases: bellatrix blocks are up to 10mb (instead of .. 30-40kb) and when backfilling, we process a lot more of them a lot faster. * fix status comparisons for nodes syncing from genesis (#3327 was a bit too hard) * don't hit database at all for post-altair slots in GetBlock v1 requests

zah · 2022-02-04T13:03:06Z

beacon_chain/sync/sync_protocol.nim

+    # TODO Semantically, this request should return a non-ref, but doing so
+    #      runs into extreme inefficiency due to the compiler introducing
+    #      hidden copies - in future nim versions, this should be revisited
+    # TODO This code is more complicated than it needs to be, in order to please


How would this code be simpler? It already looks pretty devoid of boilerplate or any other elements obscuring the intention.

Writing a client function manually or separately from the server, in a way that has the right signature (that is different from the signature of the server, because bytes and ref and so on) would be a lot easier to read and audit, as a consumer of the code (like rest).

Basically, a client for an RPC and a server implementation are two separate entities and it's useful to think of them that way as well: it's wrong for example that in order to consume and RPC, I have to import everything that's needed to implement the server - most of the time I'm looking for the client code, the first reaction is to look at this implementation.

Most people reading this comment will assume it claims that the body of the function below is more complicated than it needs to be. You could be a bit more specific in the comment perhaps.

Otherwise, what you describe above is not "simpler" - it will be more complicated in the sense that it requires more modules and definitions that must be kept in sync, but I do agree that it optimizes the metrics cited by you.

well, even this code is more complicated: it uses a typed MultipleChunksResponse with a type that never is used for forming the response - instead we go into "raw" mode with a special function whose only role is to throw away the unused type - a straightforward implementation would get rid of these indirections as well.

Complexity includes the complexity of understanding the code, not just writing it - in fact, I tend to find it a good investment to maximise simplicity for the reader at the expense of the writer: therefore it ends up "globally" more simple - choosing what information to hide and what to be explicit has a direct effect on the perceived complexity of using the code.

zah · 2022-02-04T13:20:37Z

beacon_chain/sync/request_manager.nim

    else:
-      (await peer.beaconBlocksByRoot(BlockRootsList items)).map() do (blcks: seq[phase0.SignedBeaconBlock]) -> auto:
-        blcks.mapIt(ForkedSignedBeaconBlock.init(it))
+      (await beaconBlocksByRoot(peer, BlockRootsList items)).map(


Still quite a lot of copying going on here. Would you like the network layer to allocate the blocks as refs directly? We can also bite the bullet and start using refs inside the branches of ForkedSignedBeaconBlock to avoid more copying in other places as well.

start using refs inside the branches of ForkedSignedBeaconBlock to avoid more copying in other places as well.

this goes a bit too far - it would infect the spec code with the weak guarantees of ref - already, the tests needed changing because of mutability difference between value types and reference types, and this is not a source of bugs I'd like to have to consider inside the security moat that chaindag is.

also, it's actually not that hard to do the code avoid these copies as well - but semantically, what I really am waiting for is move + lent support in the language - with that in place we can write semantically correct code and avoid most copies in a principled way. The profiler says that this copy is not really an issue, but should it turn into one, there's a simple way to avoid it as well, specially if we drop the v1 request (which should be safe to do).

Technically, it's possible to use ref fields inside the "Forked" types and still have a similar external views over the data if we introduce some accessors in place of the currently directly exposed fields. The accessors will dereference the ref fields and will preserve the property that immutable Forked object can provide you only with immutable fields.

well, it's a good workaround to keep in mind should we find that we need it - at this point, even at backfill speeds we really don't - also because it's only needed in asynchronous layers, in general - in non-async code, we're already pretty good at avoiding copies - this is another reason to prefer explicit ref Xxx over ref object: we can be precise about where a reference is necessary (either for semantic or efficiency reasons).

The sync manager code in particular was vulnerable to this effect because it was shuffling seq:s of blocks around in several queues - it would in fact have been possible to use a ref seq[Forked... instead for the same efficiency benefit.

all that said, I'm not strongly opposed to the idea, but I find it an ugly hack / workaround that I would only introduce if forced, since better options (hopefully) will work in future nim:s.

This also reminds me of the importance of having let a = b.x be an alias, and not a copy - this is a critical change that needs to happen in nim for all my beautiful theory to become practical - that, and fixing all the unnecessary-temporary bugs, of which there are plenty.

in writing this comment, I started liking the ref seq idea even more, fwiw ;)

arnetheduck force-pushed the harden-block-sync branch from f92545c to e8e52f3 Compare February 4, 2022 09:47

zah reviewed Feb 4, 2022

View reviewed changes

tighten comments

adc1682

zah merged commit c7abc97 into unstable Feb 7, 2022

zah deleted the harden-block-sync branch February 7, 2022 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harden and speed up block sync #3358

harden and speed up block sync #3358

arnetheduck commented Feb 3, 2022

github-actions bot commented Feb 3, 2022 •

edited

Loading

zah Feb 4, 2022 •

edited

Loading

arnetheduck Feb 4, 2022

zah Feb 4, 2022 •

edited

Loading

arnetheduck Feb 4, 2022

arnetheduck Feb 4, 2022

zah Feb 4, 2022 •

edited

Loading

arnetheduck Feb 4, 2022

zah Feb 4, 2022

arnetheduck Feb 4, 2022

arnetheduck Feb 4, 2022

harden and speed up block sync #3358

harden and speed up block sync #3358

Conversation

arnetheduck commented Feb 3, 2022

github-actions bot commented Feb 3, 2022 • edited Loading

Unit Test Results

zah Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

zah Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

zah Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

zah Feb 4, 2022

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

arnetheduck Feb 4, 2022

Choose a reason for hiding this comment

github-actions bot commented Feb 3, 2022 •

edited

Loading

zah Feb 4, 2022 •

edited

Loading

zah Feb 4, 2022 •

edited

Loading

zah Feb 4, 2022 •

edited

Loading