-
Notifications
You must be signed in to change notification settings - Fork 4
Fix re-auth hang. #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% happy that we are hardcoding test flags into the code, but I understand that e.g. rewriting the test to use mocks here would have cost us much more time/effort.
I have left some comments, let me know how you see it. Thanks!
zk/conn_test.go
Outdated
select { | ||
case <-finish: | ||
return | ||
case <-time.After(5 * sessionTimeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if e.g. StartTestCluster
takes longer than 10s?
zk/conn_test.go
Outdated
if err != nil { | ||
panic(err) | ||
} | ||
for conn.State() != StateHasSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we time-limit this wait somehow?
zk/conn.go
Outdated
@@ -102,6 +102,10 @@ type Conn struct { | |||
setWatchLimit int | |||
setWatchCallback func([]*setWatchesRequest) | |||
|
|||
// Debug (for recurring re-auth hang) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention "test" here, just like it is the case for reconnectLatch
field.
zk/conn.go
Outdated
@@ -400,6 +431,10 @@ func (c *Conn) resendZkAuth(reauthReadyChan chan struct{}) { | |||
} | |||
|
|||
for _, cred := range c.creds { | |||
if shouldCancel() { | |||
c.logger.Printf("Cancel rer-submitting credentials") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/rer/re/
zk/conn.go
Outdated
@@ -476,6 +520,18 @@ func (c *Conn) loop() { | |||
wg.Add(1) | |||
go func() { | |||
<-reauthChan | |||
// This condition exists for signaling purposes, that the test | |||
// `TestRecurringReAuthHang` was successful. The previous call | |||
// `<-reauthChan` was not blocking. That means the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/not blocking/did not block/
zk/conn.go
Outdated
// See `TestRecurringReAuthHang` | ||
if c.shouldDebugCloseRecvLoop() { | ||
select { | ||
case <-c.debugReauthDone: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we try to read from this channel here?
The test is reading from it as well, in what it seems a wait for close(c.debugReauthDone)
which happens in the default:
case below. Can we drop the select
statement altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the use of goroutines scheduler it is possible that ZK client will try to reconnect to ZK before we close the ZK connection in tests. In that case the test would panic()
if second attempt to connect was executed and c. debugReauthDone
was closed.
zk/conn_test.go
Outdated
case <-finish: | ||
return | ||
case <-time.After(5 * sessionTimeout): | ||
panic("expected not hang") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using here panic
to signal that the test has failed.
What do you think about re-structuring the test a bit. Instead of launching a go subroutine that panic after some time, we push the:
+ currentServer := conn.Server()
+ conn.setDebugCloseRecvLoop(true)
+ zkC.StopServer(currentServer)
+ // wait connect to new zookeeper.
+ for conn.Server() == currentServer && conn.State() != StateHasSession {
+ time.Sleep(100 * time.Millisecond)
+ }
Block into a goroutine and wait in the test's goroutine for it to finish, or time out. I think that we could even get away without a goroutine and just do the wait-and-timeout in the block:
+ for conn.Server() == currentServer && conn.State() != StateHasSession {
+ time.Sleep(100 * time.Millisecond)
+ }
This could simplify the test.
@vespian I've addressed your comments. |
e23213b
to
6b4cd28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐑
// to reconnect multiple times before cleanly closing the | ||
// test. This select here is to prevent closing | ||
// `c.debugReauthDone` channel twice during the test and | ||
// panic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
zk/conn_test.go
Outdated
|
||
select { | ||
case <-ctx.Done(): | ||
t.Fatal("Failed to connect to reconnect ZK") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Library failed to reconnect to ZK in a timely manner.
?
Manually cherry-pick patches from samuel#181