Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-node: Use resubscriber context when watching head #8128

Closed
wants to merge 1 commit into from

Conversation

ajsutton
Copy link
Contributor

Description

The L1Heads subscription was using the resourcesCtx rather than the context supplied by the ResubscribeErr function which led to a deadlock when calling l1HeadsSub.Unsubscribe and prevented op-node from shutting down cleanly.

Tests

Manually tested shutdown.

Additional context

Not addressed by this, there is also a fairly long delay if you interrupt op-node shortly after it starts up because the FindL2Heads process has no way of being interrupted so you continue getting a bunch of Walking back L1Block by hash logs and op-node only exits once that completes.

Metadata

Fixes deadlock during shutdown when unsubscribing from l1 head events.
@ajsutton ajsutton requested a review from a team as a code owner November 10, 2023 03:34
@ajsutton ajsutton requested a review from protolambda November 10, 2023 03:34
@ajsutton
Copy link
Contributor Author

@protolambda I have to admit that despite having spent a lot of time trying to understand ResubscribeErr I still have basically no idea how it really works or why using resourcesCtx wouldn't work here given we do cancel it before trying to unsubscribe L1 heads. But it definitely gets stuck in unsubscribe for me without this change and works with it.

Stack dump of where it gets stuck:
tmp.txt

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Merging #8128 (0464cae) into develop (314b041) will decrease coverage by 3.92%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #8128      +/-   ##
===========================================
- Coverage    53.28%   49.36%   -3.92%     
===========================================
  Files          162       86      -76     
  Lines         6171     1963    -4208     
  Branches       966      440     -526     
===========================================
- Hits          3288      969    -2319     
+ Misses        2744      916    -1828     
+ Partials       139       78      -61     
Flag Coverage Δ
cannon-go-tests ?
chain-mon-tests ?
common-ts-tests ?
contracts-bedrock-tests 49.36% <ø> (-10.80%) ⬇️
contracts-ts-tests ?
core-utils-tests ?
sdk-next-tests ?
sdk-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 96 files with indirect coverage changes

@Inphi
Copy link
Contributor

Inphi commented Nov 10, 2023

The deadlock issue was a bug in go-ethereum's Resubscribe that's been fixed. Though there isn't yet a geth release containing it.

@@ -171,10 +171,13 @@ func (n *OpNode) initL1(ctx context.Context, cfg *Config) error {

// Keep subscribed to the L1 heads, which keeps the L1 maintainer pointing to the best headers to sync
n.l1HeadsSub = event.ResubscribeErr(time.Second*10, func(ctx context.Context, err error) (event.Subscription, error) {
if errors.Is(err, context.Canceled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an issue could occur during network reconnects. Not totally sure, but iirc the context will end up being canceled. So the op-node will fail to receive head changes in that event.
Worth confirming this won't occur locally or by adding a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You appear to be right. What's the point of passing in a context that can't actually be used??? If you don't ignore Canceled errors then it winds up in an infinite loop because Unsubscribe cancels the context which prevents new resubscribe attempts from succeeding and then calls this method again in an infinite loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late response. But the passed in context lets users to prevent a resubscribe based the parent context, like during shutdown. Problem here is both the shutdown signal and other sources of cancelations are observed the same way. eth.WatchHeadChanges already check for both cases. So once we have the deadlock fixes in it should resolve itself.

@ajsutton ajsutton added the M-do-not-merge Meta: Do not merge label Nov 10, 2023
@ajsutton
Copy link
Contributor Author

Closing - will cherry-pick the upstream fix instead (ethereum-optimism/op-geth#183).

@ajsutton ajsutton closed this Nov 12, 2023
@ajsutton ajsutton deleted the aj/shutdown branch October 23, 2024 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
M-do-not-merge Meta: Do not merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants