-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Fix race condition in freeport #7567
[BUGFIX] Fix race condition in freeport #7567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a race :)
@i0rek I fixed with your suggestion |
Ah I see the issue now. I think there's still a slight problem with goroutine lifespans with your suggestion. I'm testing locally and will report back. |
@rboyer the test is running for 1h+ now without any failure with
|
It's a very thin edge case condition, but this version in the PR only signals to the goroutine to stop, but does not actually wait for that to happen. I suspect that you could get the same panic if the machine was very burdened causing time to slow down a lot. Here's an alternate implementation that actually waits:
|
@rboyer your solution is more elegant and probably safer, I agree. I applied your patch |
@rboyer While I was not able to crash it for more than 1h with previous implementation, with you patch it fails after ~100 tests being run in a timeout:
|
1222f4f
to
209fbc1
Compare
@rboyer Found the issue: in you patch, you did watch for goroutine to end while having the mutex. |
I'm having terrifying memories come back to me around agent service registration locking. Relevant subset of changed code from the last snippet:
Contractually it shouldn't be true that there are |
This removes a race condition in reset since pendingPorts can be set to nil in reset() If ticker is hit at wrong time, it would crash the unit test. We ensure in reset to avoid this race condition by cancelling the goroutine using killTicker chan. We also properly clean up eveything, so garbage collector can work as expected. To reproduce existing bug: `while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done` Will crash after a few 10s runs on my machine. Error could be seen in unit tests sometimes: [INFO] freeport: resetting the freeport package state panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1125536] goroutine 25 [running]: container/list.(*List).Len(...) /usr/local/Cellar/go/1.14/libexec/src/container/list/list.go:66 github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce() /Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:157 +0x86 github.com/hashicorp/consul/sdk/freeport.checkFreedPorts() /Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:147 +0x71 created by github.com/hashicorp/consul/sdk/freeport.initialize /Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:113 +0x2cf FAIL github.com/hashicorp/consul/sdk/freeport 1.607s
209fbc1
to
d095a5d
Compare
@rboyer DONE as you suggested in last commit |
grrrr.. unstable tests |
668e6b7
to
a417900
Compare
…d merge upstream. Due to hashicorp#7562
a417900
to
81b9e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This removes a race condition in reset since pendingPorts can be set to nil in reset()
If ticker is hit at wrong time, it would crash the unit test.
We ensure in reset to avoid this race condition by cancelling the goroutine using
killTicker chan.
We also properly clean up eveything, so garbage collector can work as expected.
To reproduce existing bug:
while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done
Will crash after a few 10s runs on my machine.
Error could be seen in unit tests sometimes:
[INFO] freeport: resetting the freeport package state
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1125536]
goroutine 25 [running]:
container/list.(*List).Len(...)
/usr/local/Cellar/go/1.14/libexec/src/container/list/list.go:66
github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:157 +0x86
github.com/hashicorp/consul/sdk/freeport.checkFreedPorts()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:147 +0x71
created by github.com/hashicorp/consul/sdk/freeport.initialize
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:113 +0x2cf
FAIL github.com/hashicorp/consul/sdk/freeport 1.607s