Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Fix re-auth hang. #1

Merged
merged 4 commits into from
Jan 30, 2018
Merged

Fix re-auth hang. #1

merged 4 commits into from
Jan 30, 2018

Conversation

mhrabovcin
Copy link

Manually cherry-pick patches from samuel#181

Copy link

@vespian vespian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% happy that we are hardcoding test flags into the code, but I understand that e.g. rewriting the test to use mocks here would have cost us much more time/effort.

I have left some comments, let me know how you see it. Thanks!

zk/conn_test.go Outdated
select {
case <-finish:
return
case <-time.After(5 * sessionTimeout):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if e.g. StartTestCluster takes longer than 10s?

zk/conn_test.go Outdated
if err != nil {
panic(err)
}
for conn.State() != StateHasSession {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we time-limit this wait somehow?

zk/conn.go Outdated
@@ -102,6 +102,10 @@ type Conn struct {
setWatchLimit int
setWatchCallback func([]*setWatchesRequest)

// Debug (for recurring re-auth hang)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention "test" here, just like it is the case for reconnectLatch field.

zk/conn.go Outdated
@@ -400,6 +431,10 @@ func (c *Conn) resendZkAuth(reauthReadyChan chan struct{}) {
}

for _, cred := range c.creds {
if shouldCancel() {
c.logger.Printf("Cancel rer-submitting credentials")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rer/re/

zk/conn.go Outdated
@@ -476,6 +520,18 @@ func (c *Conn) loop() {
wg.Add(1)
go func() {
<-reauthChan
// This condition exists for signaling purposes, that the test
// `TestRecurringReAuthHang` was successful. The previous call
// `<-reauthChan` was not blocking. That means the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/not blocking/did not block/

zk/conn.go Outdated
// See `TestRecurringReAuthHang`
if c.shouldDebugCloseRecvLoop() {
select {
case <-c.debugReauthDone:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we try to read from this channel here?

The test is reading from it as well, in what it seems a wait for close(c.debugReauthDone) which happens in the default: case below. Can we drop the select statement altogether?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the use of goroutines scheduler it is possible that ZK client will try to reconnect to ZK before we close the ZK connection in tests. In that case the test would panic() if second attempt to connect was executed and c. debugReauthDone was closed.

zk/conn_test.go Outdated
case <-finish:
return
case <-time.After(5 * sessionTimeout):
panic("expected not hang")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using here panic to signal that the test has failed.

What do you think about re-structuring the test a bit. Instead of launching a go subroutine that panic after some time, we push the:

+	currentServer := conn.Server()
+	conn.setDebugCloseRecvLoop(true)
+	zkC.StopServer(currentServer)
+	// wait connect to new zookeeper.
+	for conn.Server() == currentServer && conn.State() != StateHasSession {
+		time.Sleep(100 * time.Millisecond)
+	}

Block into a goroutine and wait in the test's goroutine for it to finish, or time out. I think that we could even get away without a goroutine and just do the wait-and-timeout in the block:

+	for conn.Server() == currentServer && conn.State() != StateHasSession {
+		time.Sleep(100 * time.Millisecond)
+	}

This could simplify the test.

@mhrabovcin
Copy link
Author

@vespian I've addressed your comments.

@mhrabovcin mhrabovcin force-pushed the mh/reauth branch 3 times, most recently from e23213b to 6b4cd28 Compare January 30, 2018 13:51
Copy link

@vespian vespian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑

// to reconnect multiple times before cleanly closing the
// test. This select here is to prevent closing
// `c.debugReauthDone` channel twice during the test and
// panic.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

zk/conn_test.go Outdated

select {
case <-ctx.Done():
t.Fatal("Failed to connect to reconnect ZK")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Library failed to reconnect to ZK in a timely manner.?

@vespian vespian merged commit 695728a into master Jan 30, 2018
@vespian vespian deleted the mh/reauth branch January 30, 2018 14:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants