-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
op-node: Use resubscriber context when watching head #8128
Conversation
Fixes deadlock during shutdown when unsubscribing from l1 head events.
@protolambda I have to admit that despite having spent a lot of time trying to understand Stack dump of where it gets stuck: |
Codecov Report
@@ Coverage Diff @@
## develop #8128 +/- ##
===========================================
- Coverage 53.28% 49.36% -3.92%
===========================================
Files 162 86 -76
Lines 6171 1963 -4208
Branches 966 440 -526
===========================================
- Hits 3288 969 -2319
+ Misses 2744 916 -1828
+ Partials 139 78 -61
Flags with carried forward coverage won't be shown. Click here to find out more. |
The deadlock issue was a bug in go-ethereum's |
@@ -171,10 +171,13 @@ func (n *OpNode) initL1(ctx context.Context, cfg *Config) error { | |||
|
|||
// Keep subscribed to the L1 heads, which keeps the L1 maintainer pointing to the best headers to sync | |||
n.l1HeadsSub = event.ResubscribeErr(time.Second*10, func(ctx context.Context, err error) (event.Subscription, error) { | |||
if errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an issue could occur during network reconnects. Not totally sure, but iirc the context will end up being canceled. So the op-node will fail to receive head changes in that event.
Worth confirming this won't occur locally or by adding a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You appear to be right. What's the point of passing in a context that can't actually be used??? If you don't ignore Canceled errors then it winds up in an infinite loop because Unsubscribe
cancels the context which prevents new resubscribe attempts from succeeding and then calls this method again in an infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late response. But the passed in context lets users to prevent a resubscribe based the parent context, like during shutdown. Problem here is both the shutdown signal and other sources of cancelations are observed the same way. eth.WatchHeadChanges
already check for both cases. So once we have the deadlock fixes in it should resolve itself.
Closing - will cherry-pick the upstream fix instead (ethereum-optimism/op-geth#183). |
Description
The L1Heads subscription was using the
resourcesCtx
rather than the context supplied by theResubscribeErr
function which led to a deadlock when callingl1HeadsSub.Unsubscribe
and prevented op-node from shutting down cleanly.Tests
Manually tested shutdown.
Additional context
Not addressed by this, there is also a fairly long delay if you interrupt op-node shortly after it starts up because the
FindL2Heads
process has no way of being interrupted so you continue getting a bunch ofWalking back L1Block by hash
logs and op-node only exits once that completes.Metadata
SIGINT
orSIGTERM
#8086