op-node: Use resubscriber context when watching head #8128

ajsutton · 2023-11-10T03:34:16Z

Description

The L1Heads subscription was using the resourcesCtx rather than the context supplied by the ResubscribeErr function which led to a deadlock when calling l1HeadsSub.Unsubscribe and prevented op-node from shutting down cleanly.

Tests

Manually tested shutdown.

Additional context

Not addressed by this, there is also a fairly long delay if you interrupt op-node shortly after it starts up because the FindL2Heads process has no way of being interrupted so you continue getting a bunch of Walking back L1Block by hash logs and op-node only exits once that completes.

Metadata

op-node does not respond to SIGINT or SIGTERM #8086

Fixes deadlock during shutdown when unsubscribing from l1 head events.

ajsutton · 2023-11-10T03:37:19Z

@protolambda I have to admit that despite having spent a lot of time trying to understand ResubscribeErr I still have basically no idea how it really works or why using resourcesCtx wouldn't work here given we do cancel it before trying to unsubscribe L1 heads. But it definitely gets stuck in unsubscribe for me without this change and works with it.

Stack dump of where it gets stuck:
tmp.txt

codecov · 2023-11-10T03:53:06Z

Codecov Report

Merging #8128 (0464cae) into develop (314b041) will decrease coverage by 3.92%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #8128      +/-   ##
===========================================
- Coverage    53.28%   49.36%   -3.92%     
===========================================
  Files          162       86      -76     
  Lines         6171     1963    -4208     
  Branches       966      440     -526     
===========================================
- Hits          3288      969    -2319     
+ Misses        2744      916    -1828     
+ Partials       139       78      -61

Flag	Coverage Δ
cannon-go-tests	`?`
chain-mon-tests	`?`
common-ts-tests	`?`
contracts-bedrock-tests	`49.36% <ø> (-10.80%)`	⬇️
contracts-ts-tests	`?`
core-utils-tests	`?`
sdk-next-tests	`?`
sdk-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 96 files with indirect coverage changes

Inphi · 2023-11-10T04:04:41Z

The deadlock issue was a bug in go-ethereum's Resubscribe that's been fixed. Though there isn't yet a geth release containing it.

Inphi · 2023-11-10T04:59:30Z

op-node/node/node.go

@@ -171,10 +171,13 @@ func (n *OpNode) initL1(ctx context.Context, cfg *Config) error {

 	// Keep subscribed to the L1 heads, which keeps the L1 maintainer pointing to the best headers to sync
 	n.l1HeadsSub = event.ResubscribeErr(time.Second*10, func(ctx context.Context, err error) (event.Subscription, error) {
+		if errors.Is(err, context.Canceled) {


an issue could occur during network reconnects. Not totally sure, but iirc the context will end up being canceled. So the op-node will fail to receive head changes in that event.
Worth confirming this won't occur locally or by adding a test.

You appear to be right. What's the point of passing in a context that can't actually be used??? If you don't ignore Canceled errors then it winds up in an infinite loop because Unsubscribe cancels the context which prevents new resubscribe attempts from succeeding and then calls this method again in an infinite loop.

Late response. But the passed in context lets users to prevent a resubscribe based the parent context, like during shutdown. Problem here is both the shutdown signal and other sources of cancelations are observed the same way. eth.WatchHeadChanges already check for both cases. So once we have the deadlock fixes in it should resolve itself.

ajsutton · 2023-11-12T23:44:00Z

Closing - will cherry-pick the upstream fix instead (ethereum-optimism/op-geth#183).

op-node: Use resubscriber context when watching head

0464cae

Fixes deadlock during shutdown when unsubscribing from l1 head events.

ajsutton requested a review from a team as a code owner November 10, 2023 03:34

ajsutton requested a review from protolambda November 10, 2023 03:34

Inphi reviewed Nov 10, 2023

View reviewed changes

ajsutton added the M-do-not-merge Meta: Do not merge label Nov 10, 2023

ajsutton closed this Nov 12, 2023

sebastianst mentioned this pull request Nov 10, 2023

op-node: Improve shutdown behavior #10936

Closed

ajsutton deleted the aj/shutdown branch October 23, 2024 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-node: Use resubscriber context when watching head #8128

op-node: Use resubscriber context when watching head #8128

ajsutton commented Nov 10, 2023

ajsutton commented Nov 10, 2023

codecov bot commented Nov 10, 2023 •

edited

Loading

Inphi commented Nov 10, 2023

Inphi Nov 10, 2023

ajsutton Nov 10, 2023

Inphi Nov 14, 2023

ajsutton commented Nov 12, 2023

op-node: Use resubscriber context when watching head #8128

op-node: Use resubscriber context when watching head #8128

Conversation

ajsutton commented Nov 10, 2023

ajsutton commented Nov 10, 2023

codecov bot commented Nov 10, 2023 • edited Loading

Codecov Report

Inphi commented Nov 10, 2023

Inphi Nov 10, 2023

Choose a reason for hiding this comment

ajsutton Nov 10, 2023

Choose a reason for hiding this comment

Inphi Nov 14, 2023

Choose a reason for hiding this comment

ajsutton commented Nov 12, 2023

codecov bot commented Nov 10, 2023 •

edited

Loading