-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClientV3: Add integration test for server timeout #10211
Conversation
/retest |
@spzala: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Codecov Report
@@ Coverage Diff @@
## master #10211 +/- ##
==========================================
+ Coverage 71.59% 71.71% +0.11%
==========================================
Files 391 391
Lines 36386 36386
==========================================
+ Hits 26051 26094 +43
+ Misses 8514 8481 -33
+ Partials 1821 1811 -10
Continue to review full report at Codecov.
|
} | ||
clus.Members[1].Blackhole() | ||
time.Sleep(10 * time.Second) | ||
// remove blackhole but connection should be unavailable now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// remove blackhole but keepalive ping should trigger server to close server-to-client connection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's a better detail.
time.Sleep(10 * time.Second) | ||
// remove blackhole but connection should be unavailable now | ||
clus.Members[1].Unblackhole() | ||
if _, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1")
if err == nil {
t.Error("rpc error expected")
}
ev, ok := status.FromError(err)
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that sounds good. Thanks @gyuho !!
) | ||
|
||
func TestServerGRPCKeepAliveTimeout(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also document what this is testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that sounds good.
// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that | ||
// after having pinged for keepalive check, the server waits for a duration of Timeout and | ||
// if no activity is seen even after that the connection is closed. The keepalive takes few | ||
// seconds to get in impact so typically the the time it takes to close the connection is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the the/the/
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) Ahh, Yup. Thanks @gyuho
@@ -30,6 +30,11 @@ import ( | |||
"google.golang.org/grpc/status" | |||
) | |||
|
|||
// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that | |||
// after having pinged for keepalive check, the server waits for a duration of Timeout and | |||
// if no activity is seen even after that the connection is closed. The keepalive takes few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no client activity is seen from clients, the connection is closed on server-side
, to make it clear it's server-side operation that's being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits to make comment clearer, then LGTM thanks!
/assign |
/lgtm |
t.Fatal(err) | ||
} | ||
clus.Members[1].Blackhole() | ||
time.Sleep(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the test is flaky with GRPCKeepAliveTimeout set to 1 sec and waiting time set to 10 sec. Can we maybe reduce the GRPCKeepAliveTimeout?
https://travis-ci.com/etcd-io/etcd/jobs/154059557#L936
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks! Yes I noticed that another travis job passed but this failed. It's known that gRPC requires extra time (as I have also commented as test doc) so keeping waiting to 10sec but reducing timeout. I tested it multiple times locally and it looks good, hopefully that will be case with travis too :)
Starting 'integration' pass at Thu Oct 25 08:59:54 EDT 2018
Running integration tests...
=== RUN TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.28s)
=== RUN TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.28s)
=== RUN TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.20s)
PASS
ok go.etcd.io/etcd/clientv3/integration 30.796s
Finished 'integration' pass at Thu Oct 25 09:00:42 EDT 2018
Endpoints: []string{eps[0]}, | ||
} | ||
|
||
cli, err := clientv3.New(ccfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need cli in this test, giving that NewClusterV3() already created clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the client is created here. I believe we need this for the desired config. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was not able to understand. Do you mean we need this for the desired config for the cluster? My understanding is that the keepAlive parameters were configured on server side. And clus.clients are configured inside NewClusterV3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what my understanding is - to override defaults used in NewClusterV3. In this particular case, it's just the endpoint though and it's more readable this way. Hope makes sense but I am open for suggestion. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the endpoints of clus.clients are configured in NewClusterV3:
https://github.com/etcd-io/etcd/blob/master/integration/cluster.go#L740
It is not clear to me what override means. cli and clus.client(1) are two different ClientV3 instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I tried to fix your concern in the new commit. Thanks!
New changes are detected. LGTM label has been removed. |
BTW, the test is still flaky with the reduced |
/approve cancel |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@spzala Yeah, |
// close server-to-client connection. | ||
clus.Members[1].Unblackhole() | ||
_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1") | ||
// TODO: keepalive sometimes doesn't work on first attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyuho hi, somewhat inspired by this https://github.com/etcd-io/etcd/blob/master/clientv3/integration/black_hole_test.go#L197 I have a for loop added here. Also, the info log should help understand how often the test iterates. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... when I run test locally it usually pass in a first attempt and sometime more for the third run of test but here it's failing again :( (Outside tests, it runs fine by manually dropping client connection as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyuho so this time it did pass build. Seems like 1 cpu is always fine but 2 and 4 doesn't go in first attempt (on my local env it actually does pass mostly). Can we keep this solution and watch it (I will be watching) to see behavior? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyuho I guess we need a better approach. Thinking but if you have a suggestion please let me know. Thanks!!
@gyuho hi, I was at the KubeCon Shanghai and got back today. Please take a look at my precious comment when you get a chance? Thanks!! |
c1b0979
to
b8acf37
Compare
b8acf37
to
eb552fe
Compare
@spzala Sorry for delay. Tests are still failing. Can you try increasing the sleep duration and see if that helps? |
Thanks @gyuho and no problem! OK, I did but lemme play little more with it. |
@gyuho hi, increasing sleep time didn't make much different in behavior on my local env so taking a slight different approach. Please take a look and let's see how build goes :). Thanks! |
@gyuho, hi, fyi, so I don't see build failure this time except suggestion on changing |
Can you add "ServerParameters" to ".words"? |
Partially Fixes etcd-io#8645
Partially Fixes etcd-io#8645
Partially Fixes etcd-io#8645
Partially Fixes etcd-io#8645
1f95e95
to
de0ecf3
Compare
Partially Fixes etcd-io#8645
@gyuho thanks, that helped! |
@gyuho hi, please let me know what you think of it, no rush :). I think build failure is now not due to the changes. Thanks! |
|
||
eps := []string{clus.Members[0].GRPCAddr()} | ||
ccfg := clientv3.Config{ | ||
Endpoints: []string{eps[0]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just Endpoints: []string{clus.Members[0].GRPCAddr()}
?
} | ||
break | ||
} else { | ||
t.Logf("info: expected an error, received none.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just t.Log("expected an error from server, but received none")
@spzala Sorry for delay. @jingyih @hexfusion Can you take a look at this as well? Thanks! |
@gyuho no problem at all :-) and thanks!! @hexfusion hmm.. on my local VM it usually works for me (not always but failure is once in a while) but yes, I am up exploring it further. Thanks! |
@jingyih @hexfusion as we discussed at the kubecon, if you can try solve the mystery with this flaky test in your local env and co-author the PR that would be great. The test seems pass first time but then it fails. Thanks!! |
Been traveling but will dig in after the new year.
On Thu, Dec 27, 2018 at 1:26 PM Sahdev Zala ***@***.***> wrote:
@jingyih <https://github.com/jingyih> @hexfusion
<https://github.com/hexfusion> as we discussed at the kubecon, if you can
try solve the mystery with this flaky test in your local env and co-author
the PR that would be great. The test seems pass first time but then it
fails. Thanks!!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10211 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABMR1Zn4gANKYeySCkon6KvR2QOvYX61ks5u9RC3gaJpZM4X4npc>
.
--
Best,
Sam Batschelet
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Add integration test for server timeout.
Partially Fixes #8645