Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#31 resurfacing (?) #170

Closed
jxsl13 opened this issue Feb 6, 2023 · 9 comments · Fixed by #171
Closed

#31 resurfacing (?) #170

jxsl13 opened this issue Feb 6, 2023 · 9 comments · Fixed by #171
Assignees
Milestone

Comments

@jxsl13
Copy link

jxsl13 commented Feb 6, 2023

Hi,

I'm currently building a derived library with reconnect handling and my tests fail due to the already closed issue:

#31

In that issue tests were fixed with t.Cleanup but the actual data race was, as far as I can see, never tackled (?)
Where the assignment to allocator does not seem to be threadsafe (goroutinesafe) resulting in data race detections.

Code lines:
https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L555

https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L977

entrypoint for asynchronous memory access (both routines access c.allocator:
https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L268-L269

Do I do anything incorrectly?
(test case, which is highly dependent on the go runtime scheduler thus not always triggering the data race: https://github.com/jxsl13/amqpx/blob/batch-handler/pool/connection_test.go#L82)

Any feedback is appreciated.
Thanks for your time.

==================
WARNING: DATA RACE
Write at 0x00c0003e7450 by goroutine 56:
  github.com/rabbitmq/amqp091-go.(*Connection).openComplete()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:977 +0x150
  github.com/rabbitmq/amqp091-go.(*Connection).openVhost()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:962 +0x1c9
  github.com/rabbitmq/amqp091-go.(*Connection).openTune()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:948 +0xc66
  github.com/rabbitmq/amqp091-go.(*Connection).openStart()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:885 +0x66f
  github.com/rabbitmq/amqp091-go.(*Connection).open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:857 +0xc4
  github.com/rabbitmq/amqp091-go.Open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:269 +0x664
  github.com/rabbitmq/amqp091-go.DialConfig()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:250 +0x904
  github.com/jxsl13/amqpx/pool.(*Connection).connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:188 +0x33a
  github.com/jxsl13/amqpx/pool.(*Connection).Connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:176 +0x90
  github.com/jxsl13/amqpx/pool.NewConnection()
      /home/jxsl13/Development/amqpx/pool/connection.go:98 +0x999
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func1()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:97 +0x26b
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func2()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:115 +0x47

Previous write at 0x00c0003e7450 by goroutine 335:
  github.com/rabbitmq/amqp091-go.(*Connection).shutdown.func1()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:555 +0x4f9
  sync.(*Once).doSlow()
      /home/jxsl13/sdk/go1.19/src/sync/once.go:74 +0x101
  sync.(*Once).Do()
      /home/jxsl13/sdk/go1.19/src/sync/once.go:65 +0x46
  github.com/rabbitmq/amqp091-go.(*Connection).shutdown()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:522 +0x84
  github.com/rabbitmq/amqp091-go.(*Connection).reader()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:662 +0x444
  github.com/rabbitmq/amqp091-go.Open.func1()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:268 +0x58

Goroutine 56 (running) created at:
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:94 +0xee
  testing.tRunner()
      /home/jxsl13/sdk/go1.19/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /home/jxsl13/sdk/go1.19/src/testing/testing.go:1493 +0x47

Goroutine 335 (finished) created at:
  github.com/rabbitmq/amqp091-go.Open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:268 +0x629
  github.com/rabbitmq/amqp091-go.DialConfig()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:250 +0x904
  github.com/jxsl13/amqpx/pool.(*Connection).connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:188 +0x33a
  github.com/jxsl13/amqpx/pool.(*Connection).Connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:176 +0x90
  github.com/jxsl13/amqpx/pool.NewConnection()
      /home/jxsl13/Development/amqpx/pool/connection.go:98 +0x999
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func1()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:97 +0x26b
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func2()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:115 +0x47
@Zerpet Zerpet self-assigned this Feb 7, 2023
@Zerpet
Copy link
Contributor

Zerpet commented Feb 7, 2023

Your test is triggering quite an edge case, where the background frame reader receives an error and returns, before Open even completes. I agree it should not race.

Zerpet added a commit that referenced this issue Feb 7, 2023
Related to #170. There is an edge case where the background frame
handler receives an error before `Open` completes. In this case, there
is a data race to assign the value of the allocator. The destructor in
shutdown is already a critical section. Adding a tiny critical section
in openComplete to protect the allocator.

Signed-off-by: Aitor Perez Cedres <[email protected]>
@Zerpet
Copy link
Contributor

Zerpet commented Feb 7, 2023

I've pushed a fix in commit c503330, also in branch issue-170. Can you validate this fix before I submit a PR?

@Zerpet Zerpet added this to the 1.7.0 milestone Feb 7, 2023
@jxsl13
Copy link
Author

jxsl13 commented Feb 7, 2023

I think you will have to also add locking around https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L555
this one, as the sync.Once is a different mutex than c.m

I don't know the lifetime of the c.allocator object in order to see whether there are any other places where it is accessed across goroutines. In such a case it might be good to have a single or multiple helper methods that lock access to the allocator, do whatever needs to be done on the allocator and then return the result.

@Zerpet
Copy link
Contributor

Zerpet commented Feb 7, 2023

The mutex c.m in the destructor is the same mutex as in c.m in openComplete; in both cases, we access a field in the Connection struct, which is the pointer receiver of Connection. The destructor is already locked as shown here:

amqp091-go/connection.go

Lines 522 to 524 in c503330

c.destructor.Do(func() {
c.m.Lock()
defer c.m.Unlock()

@jxsl13
Copy link
Author

jxsl13 commented Feb 7, 2023

I see, my bad. Will test it later (Y)

@jxsl13
Copy link
Author

jxsl13 commented Feb 7, 2023

@Zerpet looks good. I re-ran my test 20 times and could not reproduce.

Edit:
Ran another 20 times and the test failed... will look into it..

@jxsl13
Copy link
Author

jxsl13 commented Feb 7, 2023

==================
WARNING: DATA RACE
Write at 0x00c0003b80c8 by goroutine 112:
  github.com/rabbitmq/amqp091-go.(*Connection).openTune()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:917 +0x6c4
  github.com/rabbitmq/amqp091-go.(*Connection).openStart()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:885 +0x6af
  github.com/rabbitmq/amqp091-go.(*Connection).open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:857 +0xc4
  github.com/rabbitmq/amqp091-go.Open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:269 +0x664
  github.com/rabbitmq/amqp091-go.DialConfig()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:250 +0x904
  github.com/jxsl13/amqpx/pool.(*Connection).connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:188 +0x337
  github.com/jxsl13/amqpx/pool.(*Connection).Connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:176 +0x90
  github.com/jxsl13/amqpx/pool.NewConnection()
      /home/jxsl13/Development/amqpx/pool/connection.go:98 +0x999
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func1()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:97 +0x11b
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func2()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:115 +0x47

Previous read at 0x00c0003b80c8 by goroutine 138:
  github.com/rabbitmq/amqp091-go.(*Connection).shutdown.func1()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:555 +0x413
  sync.(*Once).doSlow()
      /usr/lib/go-1.20/src/sync/once.go:74 +0x101
  sync.(*Once).Do()
      /usr/lib/go-1.20/src/sync/once.go:65 +0x46
  github.com/rabbitmq/amqp091-go.(*Connection).shutdown()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:522 +0x84
  github.com/rabbitmq/amqp091-go.(*Connection).reader()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:662 +0x435
  github.com/rabbitmq/amqp091-go.Open.func1()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:268 +0x58

Goroutine 112 (running) created at:
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:94 +0xe8
  testing.tRunner()
      /usr/lib/go-1.20/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/lib/go-1.20/src/testing/testing.go:1629 +0x47

Goroutine 138 (finished) created at:
  github.com/rabbitmq/amqp091-go.Open()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:268 +0x629
  github.com/rabbitmq/amqp091-go.DialConfig()
      /home/jxsl13/go/pkg/mod/github.com/rabbitmq/[email protected]/connection.go:250 +0x904
  github.com/jxsl13/amqpx/pool.(*Connection).connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:188 +0x337
  github.com/jxsl13/amqpx/pool.(*Connection).Connect()
      /home/jxsl13/Development/amqpx/pool/connection.go:176 +0x90
  github.com/jxsl13/amqpx/pool.NewConnection()
      /home/jxsl13/Development/amqpx/pool/connection.go:98 +0x999
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func1()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:97 +0x11b
  github.com/jxsl13/amqpx/pool_test.TestNewConnectionDisconnect.func2()
      /home/jxsl13/Development/amqpx/pool/connection_test.go:115 +0x47
==================
    toxiproxy_client_test.go:125: time=2023-02-07 18:31:32.470490147 +0100 CET m=+1.375569022 level=debug, msg=enabled rabbitmq connection, 
    testing.go:1446: race detected during execution of test
--- FAIL: TestNewConnectionDisconnect (34.14s)
=== NAME  
    testing.go:1446: race detected during execution of test

In order to reproduce:

git clone https://github.com/jxsl13/amqpx.git
cd amqpx
git checkout -b batch-handler

# requires docker & docker-compose (starts a rabbitmq & toxy-prtoxy container)
make environment

# optionally you may update your branch dependency
go get github.com/rabbitmq/amqp091-go@issue-170

go test -timeout 900s -run ^TestNewConnectionDisconnect$ github.com/jxsl13/amqpx/pool -v -race -count=40

Zerpet added a commit that referenced this issue Feb 8, 2023
Related to #170. There is an edge case where the TCP connection between
client-server is setup, and AMQP handshake starts, up to the point right
after sending Tune frame. At this point, the background frame reader
receives an error from the TCP socket and starts the shutdown sequence.
At the same time, the Tune function continues (as it has not completed)
and attempts to set the ChannelMax field in the connection struct. At
the same time, the shutdown sequence initiated by the error in the frame
handler reads the ChannelMax field. This creates a race.

A potential solution is to add a critical section in tune to protect
access to ChannelMax field. The destructor in the shutdown sequence is
already a critical section, protected by the struct mutex.

Signed-off-by: Aitor Perez Cedres <[email protected]>
@Zerpet
Copy link
Contributor

Zerpet commented Feb 8, 2023

I've run the tests 3 times with -count=40 and I can't reproduce the issue. I've pushed a small fix 0ecb414 to the branch, adding a critical section in tune to protect the access to ChannelMax. Can you try out this new patch?

@jxsl13
Copy link
Author

jxsl13 commented Feb 8, 2023

I also cannot reproduce it with the new version.
I guess that should fix this issue.
Thanks for the fast respone.

lukebakken pushed a commit that referenced this issue Feb 8, 2023
* Lock allocator assignment in openComplete

Related to #170. There is an edge case where the background frame
handler receives an error before `Open` completes. In this case, there
is a data race to assign the value of the allocator. The destructor in
shutdown is already a critical section. Adding a tiny critical section
in openComplete to protect the allocator.

Signed-off-by: Aitor Perez Cedres <[email protected]>

* Add a critical section in openTune

Related to #170. There is an edge case where the TCP connection between
client-server is setup, and AMQP handshake starts, up to the point right
after sending Tune frame. At this point, the background frame reader
receives an error from the TCP socket and starts the shutdown sequence.
At the same time, the Tune function continues (as it has not completed)
and attempts to set the ChannelMax field in the connection struct. At
the same time, the shutdown sequence initiated by the error in the frame
handler reads the ChannelMax field. This creates a race.

A potential solution is to add a critical section in tune to protect
access to ChannelMax field. The destructor in the shutdown sequence is
already a critical section, protected by the struct mutex.

Signed-off-by: Aitor Perez Cedres <[email protected]>

---------

Signed-off-by: Aitor Perez Cedres <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants