Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: datarace in the handshake #17

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
sudo: false
os:
- linux
- osx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we disable macos for a reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSX is delaying builds on Travis so I disable it in repos that don't have code that might depend on the system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


language: go

go:
- 1.7.1
- 1.8

install: true

script:
install:
- make deps
- gx-go rewrite
- go get github.com/mattn/goveralls
- goveralls -service=travis-ci

script:
- gx test -v -race -coverprofile=coverage.txt -covermode=atomic .

after_success:
- bash <(curl -s https://codecov.io/bash)

cache:
directories:
Expand Down
3 changes: 3 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
coverage:
range: "50...100"
comment: off
40 changes: 28 additions & 12 deletions lazy.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ func NewMultistream(c io.ReadWriteCloser, proto string) Multistream {
type lazyConn struct {
rhandshake bool // only accessed by 'Read' should not call read async

rhlock sync.Mutex
rhlock sync.RWMutex
rhsync bool //protected by mutex
rerr error

whandshake bool

whlock sync.Mutex
whlock sync.RWMutex
whsync bool
werr error

Expand All @@ -51,7 +51,11 @@ type lazyConn struct {
}

func (l *lazyConn) Read(b []byte) (int, error) {
if !l.rhandshake {
l.rhlock.RLock()
rhandshake := l.rhandshake
l.rhlock.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was intentionally racy to avoid atomic operations on the fast path. Go routines should erroneously conclude that the handshake hasn't happened at most once (and the synchronization inside the handshake functions should prevent double handshakes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it might be only case when races are "good" but it screws over race test in all code that uses this. This means that we can't run race detector on big part of go-libp2p and go-ipfs code unless we tolerate false positives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's got to be a way to tell go this is "fine" (hopefully without writing a C extension). I know I've seen code in the go standard library that does this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can probably use internals/race. However, we can probably get a way with using sync.Once without losing much, if any, performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use internal/race internal packages are locked and can be used only used buy children of the package. In this case only buy stdlib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to do it previously with Once but I run into some problems. One of them was that I had to implement AsyncDo (simple) so not every time new goroutine is started when we want to run something once and then I got some lockup but I could not figure out what was causing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even do it in a goroutine? We need to wait for it to finish anyways. Can't we just have a lazyInitialize function and call once.Do(msm.lazyInitialize)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have two Once's, one for the read handshake, one for the write handshake (where each handshake side would asynchronously fire off the other side of the handshake (in a once)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, sorry for the massive delay...


if !rhandshake {
go l.writeHandshake()
err := l.readHandshake()
if err != nil {
Expand All @@ -69,13 +73,17 @@ func (l *lazyConn) Read(b []byte) (int, error) {
}

func (l *lazyConn) readHandshake() error {
l.rhlock.Lock()
defer l.rhlock.Unlock()

l.rhlock.RLock()
rhsync := l.rhsync
l.rhlock.RUnlock()
// if we've already done this, exit
if l.rhsync {
if rhsync {
return l.rerr
}

l.rhlock.Lock()
defer l.rhlock.Unlock()

l.rhsync = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racy. You can't check, unlock, then relock and expect the state to stay the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I think I missed because I restructured this code twice while writing it reverting to simpler solution.

Let's figure out #17 (comment) first.


for _, proto := range l.protos {
Expand All @@ -96,13 +104,17 @@ func (l *lazyConn) readHandshake() error {
}

func (l *lazyConn) writeHandshake() error {
l.whlock.Lock()
defer l.whlock.Unlock()

if l.whsync {
l.whlock.RLock()
whsync := l.whsync
l.whlock.RUnlock()
// if we've already done this, exit
if whsync {
return l.werr
}

l.whlock.Lock()
defer l.whlock.Unlock()

l.whsync = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same race here.


buf := bufio.NewWriter(l.con)
Expand All @@ -119,7 +131,11 @@ func (l *lazyConn) writeHandshake() error {
}

func (l *lazyConn) Write(b []byte) (int, error) {
if !l.whandshake {
l.whlock.RLock()
whandshake := l.whandshake
l.whlock.RUnlock()

if !whandshake {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, intentionally a racy fast-path.

go l.readHandshake()
err := l.writeHandshake()
if err != nil {
Expand Down