Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the peer connection code to avoid failure handling bugs #1610

Closed
3 tasks
teor2345 opened this issue Jan 18, 2021 · 4 comments · Fixed by #1721
Closed
3 tasks

Refactor the peer connection code to avoid failure handling bugs #1610

teor2345 opened this issue Jan 18, 2021 · 4 comments · Fixed by #1721
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup I-panic Zebra panics with an internal error message

Comments

@teor2345
Copy link
Contributor

teor2345 commented Jan 18, 2021

TODO

  • work out why the refactor sends more errors to the syncer
  • merge these changes
  • follow-up by making this code a lot easier to maintain - split functions, map complex matches to custom enums

Existing Work

PR #1817

Is your feature request related to a problem? Please describe.

In #1531 and #1600, we fixed some failure handling bugs in the zebra-network peer connection code.

But the way the code is structured risks introducing more of these bugs in future.

We should refactor this code, to make these bugs impossible.

Describe the solution you'd like

  • make fail_with take an owned Connection struct
  • if other functions also need to take an owned struct, they should return ownership on success, and call fail_with on error
  • split the failure cleanup code out of the task loop into its own separate function
  • at the end of fail_with, call the cleanup code, and end the task
  • abolish the Failed state, and delete or refactor all the code that uses it

Describe alternatives you've considered

We could do nothing, but that risks future bugs when developers add new failure modes, but don't actually test those failure modes.

We could create tests for every possible error path, but that's going to take ongoing effort.

Additional context

The panic bugs caused by this badly structured code are #1510 and #1599

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup S-needs-triage Status: A bug report needs triage labels Jan 18, 2021
@teor2345 teor2345 added P-Low and removed S-needs-triage Status: A bug report needs triage labels Jan 28, 2021
@teor2345 teor2345 added this to the 2021 Sprint 3 milestone Feb 9, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Feb 9, 2021

@yaahc and I had a chat, she'll work on this.

@teor2345
Copy link
Contributor Author

Marking this as high, because we're actually seeing the bug happen reasonably frequently.

@teor2345
Copy link
Contributor Author

We haven't seen these bugs in a long while.

A big refactor is really risky, so I'm not sure if we should do it.
I'd prefer to do smaller refactors after NU5 activation.

I think there's a much simpler alternative to this ticket for now:
#3109

@mpguerra mpguerra added P-Medium and removed P-High labels Nov 29, 2021
@teor2345
Copy link
Contributor Author

We have other tickets that would fix the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup I-panic Zebra panics with an internal error message
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants