Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update go-libp2p to 0.12.0 #8015

Merged
merged 19 commits into from
Dec 14, 2020

Conversation

Stebalien
Copy link
Contributor

What type of PR is this?

Other

What does this PR do? Why is it needed?

go-libp2p 0.12.0 made some significant changes to the stream interfaces around stream closing:

  • Close now closes in both directions and frees the stream. However, unlike FullClose did, it doesn't wait for the remote peer to respond with an EOF.
  • To close for writing, call CloseWrite (like one would on a TCP connection, etc.).

This patch:

  • Replaces calls to FullClose with Close where appropriate.
  • Replaces calls to Close with CloseWrite where appropriate.
  • Removes redundant Close calls.
  • Calls Reset to where appropriate to indicate that the request/response was aborted. Unlike Close, this will not flush and will not cause the remote peer to read an EOF. Instead, the remote peer will read an ErrReset error.
  • Ensures we always either close or reset streams. Send wasn't closing the stream on some error paths.
  • Now that stream closing is async, we explicitly wait for a response when "hanging up" on a peer (so we don't hang up before they receive our response/goodbye message).

Design Question 1

Given that all requests/responses are length delimited, the use of Reset is optional. The upsides of using reset are:

  • It's often faster than calling close (and flushing).
  • It lets the other side know that you didn't finish sending your response.

The downside is that the choice between reset/close isn't always obvious and resetting a stream after writing will cause the write to be lost (by design).

If you'd rather just stick with close and avoid reset entirely, I'm happy to rework the patch to do that.

Design Question 2

This patch makes stream closing async to more closely mimic a TCP connection. However, this can cause issues when a close is immediately followed by a disconnect. I'm currently working around this issue by mimicing the old FullClose (now closeStreamAndWait) where necessary.

If desired, I can go back to synchronously closing everywhere, at the cost of losing out on some potential (maybe) latency improvements.

Other

I'm not sure if you want an issue for this, and/or what kind of issue you're looking for (given that this isn't either a feature or a bug fix). Please advise.

@Stebalien Stebalien requested a review from a team as a code owner December 2, 2020 02:36
@CLAassistant
Copy link

CLAassistant commented Dec 2, 2020

CLA assistant check
All committers have signed the CLA.

traceutil.AnnotateError(span, err)
stream.Reset()
return nil, err
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now close the stream for writing after "sending" a metadata request. I'm pretty sure the issue here was due to a bug that I fixed in go-libp2p 0.12.0. Please make sure to test this.

Background:

Protocol negotiation in libp2p is usually lazy. This lets us bundle the first write with the protocol headers. Unfortunately, there was a bug in go-libp2p where calling Close before reading or writing would close the stream for writing without flushing the protocol headers. This has since been fixed.

if baseTopic != RPCMetaDataTopic {
if _, err := s.Encoding().EncodeWithMaxLength(stream, message); err != nil {
traceutil.AnnotateError(span, err)
stream.Reset()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were potentially leaking streams here in the past (although I can't see how this would happen in practice unless we had a local bug)

@@ -56,6 +55,9 @@ func writeErrorResponseToStream(responseCode byte, reason string, stream libp2pc
log.WithError(err).Debug("Could not generate a response error")
} else if _, err := stream.Write(resp); err != nil {
log.WithError(err).Debugf("Could not write to stream")
} else {
// If sending the error message succeeded, close to send an EOF.
stream.Close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove explicit calls to close like this if we avoid using reset (design question 1). I'm calling close here to make sure the error message is flushed and written before we abort.

beacon-chain/sync/initial-sync/initial_sync_test.go Outdated Show resolved Hide resolved
ctx, span := trace.StartSpan(ctx, "sync.rpc")
defer span.End()
span.AddAttributes(trace.StringAttribute("topic", topic))
span.AddAttributes(trace.StringAttribute("peer", stream.Conn().RemotePeer().Pretty()))
log := log.WithField("peer", stream.Conn().RemotePeer().Pretty())
// Check before hand that peer is valid.
if s.p2p.Peers().IsBad(stream.Conn().RemotePeer()) {
closeStream(stream, log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're walking away, we might as well just reset the stream.

@@ -21,11 +21,6 @@ import (
func (s *Service) beaconBlocksByRangeRPCHandler(ctx context.Context, msg interface{}, stream libp2pcore.Stream) error {
ctx, span := trace.StartSpan(ctx, "sync.BeaconBlocksByRangeHandler")
defer span.End()
defer func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now explicitly closing on success. If we don't close the stream, the main RPC handler will reset it.

@@ -117,6 +112,7 @@ func (s *Service) beaconBlocksByRangeRPCHandler(ctx context.Context, msg interfa
// wait for ticker before resuming streaming blocks to remote peer.
<-ticker.C
}
closeStream(stream, log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the RPC handler always close the stream instead of resetting, we can remove this. Up to you.

log.WithError(err).Debugf("Could not reset stream with protocol %s", stream.Protocol())
}
}()
defer closeStream(stream, log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I never bother calling Reset after Send because Send closes for writing. There are no writes to abort at this point.

@Stebalien
Copy link
Contributor Author

In terms of tests/building, someone with a functional build environment will need to sync the deps with bazel. It's not working on my arch machine.

Copy link
Member

@nisdas nisdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for doing this and cleaning them up ! Had a few thoughts on the PR

Given that all requests/responses are length delimited, the use of Reset is optional. The upsides of using reset are:

It's often faster than calling close (and flushing).
It lets the other side know that you didn't finish sending your response.
The downside is that the choice between reset/close isn't always obvious and resetting a stream after writing will cause the write to be lost (by design).

If you'd rather just stick with close and avoid reset entirely, I'm happy to rework the patch to do that.

The main reason we were using closing of streams instead of resets in request or responses is because there would be situations where the remote peer wouldn't appropriately close the streams on their end. So the resetted stream would technically still be held in memory and not
cleaned up by the GC. I wouldn't be averse to having only stream resets for request/responses if we can guarantee that all those streams would eventually be properly cleaned up.

This patch makes stream closing async to more closely mimic a TCP connection. However, this can cause issues when a close is immediately followed by a disconnect. I'm currently working around this issue by mimicing the old FullClose (now closeStreamAndWait) where necessary.

If desired, I can go back to synchronously closing everywhere, at the cost of losing out on some potential (maybe) latency improvements.

I think this is fine, as long as we handle cases where we need to disconnect immediately and therefore follow a synchronous pattern, handling stream closing async is fine for our case.

beacon-chain/sync/error.go Show resolved Hide resolved
beacon-chain/sync/error.go Outdated Show resolved Hide resolved
beacon-chain/sync/initial-sync/initial_sync_test.go Outdated Show resolved Hide resolved
@Stebalien
Copy link
Contributor Author

The main reason we were using closing of streams instead of resets in request or responses is because there would be situations where the remote peer wouldn't appropriately close the streams on their end. So the resetted stream would technically still be held in memory and not
cleaned up by the GC.

That's strange... A reset stream should be cleaned up immediately. Are you sure this wasn't Close versus FullClose? Previously, if you called Close, the stream was only closed for writing and wouldn't be garbage collected until it an EOF was read. Now, both close and Reset garbage collect the stream immediately.

Copy link
Member

@prestonvanloon prestonvanloon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change! It certainly saves us a good bit of time updating libp2p.

This repo has strict rules on error handling. If these APIs return an error, it must be handled appropriately.

beacon-chain/p2p/sender.go Outdated Show resolved Hide resolved
beacon-chain/p2p/sender.go Outdated Show resolved Hide resolved
beacon-chain/p2p/testing/p2p.go Outdated Show resolved Hide resolved
beacon-chain/p2p/testing/p2p.go Outdated Show resolved Hide resolved
beacon-chain/p2p/testing/p2p.go Outdated Show resolved Hide resolved
beacon-chain/sync/error.go Outdated Show resolved Hide resolved
beacon-chain/sync/error.go Outdated Show resolved Hide resolved
beacon-chain/sync/rpc.go Outdated Show resolved Hide resolved
beacon-chain/sync/initial-sync/initial_sync_test.go Outdated Show resolved Hide resolved
beacon-chain/sync/initial-sync/initial_sync_test.go Outdated Show resolved Hide resolved
go-libp2p 0.12.0 made some significant changes to the stream interfaces around
stream closing:

* Close now closes in both directions and frees the stream. However, unlike
FullClose did, it doesn't _wait_ for the remote peer to respond with an EOF.
* To close for writing, call CloseWrite (like one would on a TCP connection, etc.).

This patch:

* Replaces calls to FullClose with Close where appropriate.
* Replaces calls to Close with CloseWrite where appropriate.
* Removes redundant Close calls.
* Calls Reset to where appropriate to indicate that the request/response was
  aborted. Unlike Close, this will not flush and will not cause the remote peer
  to read an EOF. Instead, the remote peer will read an ErrReset error.
* Ensures we always either close or reset streams. Send wasn't closing the
  stream on some error paths.
* Now that stream closing is async, we explicitly wait for a response when
  "hanging up" on a peer (so we don't hang up before they receive our
  response/goodbye message).
}

// Close stream for writing.
if err := stream.Close(); err != nil {
if err := stream.CloseWrite(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build is complaining about this.

beacon-chain/p2p/sender.go:45:18: stream.CloseWrite undefined (type network.Stream has no field or method CloseWrite)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to assume this is some kind of bazel thing. CloseWrite is implemented on the stream interface.

@@ -101,6 +101,7 @@ func (p *TestP2P) ReceiveRPC(topic string, msg proto.Message) {

n, err := p.Encoding().EncodeWithMaxLength(s, msg)
if err != nil {
_ = s.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem appropriate. Why ignore stream.Reset errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. They're not useful. We only call Reset when we run into some other more relevant error. At that point, we don't really care about the stream (other than freeing resources).
  2. Reset is guaranteed to free resources, even if it returns an error.
  3. No stream multiplexer actually returns an error on reset. We return errors for completeness, but if
    I were to redesign these interfaces from scratch, I probably wouldn't.

WORKSPACE Outdated Show resolved Hide resolved
@Stebalien
Copy link
Contributor Author

@prestonvanloon I rebased on develop and lost your commit. I believe I've revived the relevant parts, but please check.

@prestonvanloon
Copy link
Member

@Stebalien All good! I ran gazelle again and it only had 1 thing to update on deps.bzl.

Build is still failing from unhandled errors.

compilepkg: nogo: errors found by nogo during build-time code analysis:
/home/preston/.cache/bazel/_bazel_preston/6e220e5720b8243fd605645eb29a29ba/sandbox/linux-sandbox/131/execroot/prysm/beacon-chain/p2p/sender.go:39:4: Unhandled error for function call (github.com/libp2p/go-libp2p-core/mux.MuxedStream).Reset
/home/preston/.cache/bazel/_bazel_preston/6e220e5720b8243fd605645eb29a29ba/sandbox/linux-sandbox/131/execroot/prysm/beacon-chain/p2p/sender.go:47:3: Unhandled error for function call (github.com/libp2p/go-libp2p-core/mux.MuxedStream).Reset
Target //beacon-chain:beacon-chain failed to build

@Stebalien
Copy link
Contributor Author

Build is still failing from unhandled errors.

On the latest version? I'm explicitly ignoring those.

@Stebalien
Copy link
Contributor Author

Ok, it looks like the linter doesn't understand (or intentionally ignores?) the syntax for explicitly dropping a return value. That's really annoying.

@prestonvanloon
Copy link
Member

Ok, it looks like the linter doesn't understand (or intentionally ignores?) the syntax for explicitly dropping a return value. That's really annoying.

This is by design. Ignoring an error isn't handling the error. You'll have to log it or return it or do something with it.

Comment on lines +39 to +40
_err := stream.Reset()
_ = _err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't great. Why does stream.Reset() return an error if we aren't going to handle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an error just in case there might be something useful to return (and as I said, it probably should never have returned an error in the first place). We can just log it, but that won't actually do anything in practice.

@nisdas
Copy link
Member

nisdas commented Dec 3, 2020

That's strange... A reset stream should be cleaned up immediately. Are you sure this wasn't Close versus FullClose? Previously, if you called Close, the stream was only closed for writing and wouldn't be garbage collected until it an EOF was read. Now, both close and Reset garbage collect the stream immediately.

Ah nope this was reset vs FullClose, after a period of time non garbage collected streams would be a non trivial part of our heap which is why we eventually went with a full close. I will test this PR out in runtime and get back to you on it, if there are any issues with it they should be obvious. We changed this a while back, so the particular issue I am referencing might already be patched in master.

@Stebalien
Copy link
Contributor Author

Ah, I think I know what the issue was. We had a bug in yamux where closing an "in flight" (not yet opened) stream could leak it. Reset would just close it immediately but FullClose would end up waiting for the stream to no longer be "in flight" before completely closing it.

@prestonvanloon prestonvanloon added this to the v1.0.5 milestone Dec 5, 2020
nisdas
nisdas previously approved these changes Dec 8, 2020
Copy link
Member

@nisdas nisdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for doing this, its nice to be able to handle this with just stream resets, makes our code cleaner.

@nisdas nisdas dismissed stale reviews from prestonvanloon and themself via d31334a December 11, 2020 06:47
terencechain
terencechain previously approved these changes Dec 14, 2020
nisdas
nisdas previously approved these changes Dec 14, 2020
prestonvanloon
prestonvanloon previously approved these changes Dec 14, 2020
@prestonvanloon prestonvanloon dismissed stale reviews from nisdas, terencechain, and themself via 2670100 December 14, 2020 16:49
prestonvanloon
prestonvanloon previously approved these changes Dec 14, 2020
terencechain
terencechain previously approved these changes Dec 14, 2020
@prestonvanloon prestonvanloon dismissed stale reviews from terencechain and themself via b1f21ea December 14, 2020 17:05
@prylabs-bulldozer prylabs-bulldozer bot merged commit 2428880 into prysmaticlabs:develop Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants