ClientV3: Add integration test for server timeout #10211

spzala · 2018-10-24T19:13:54Z

Add integration test for server timeout.

Partially Fixes #8645

spzala · 2018-10-24T19:23:29Z

/retest

gyuho-bot · 2018-10-24T19:24:06Z

@spzala: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
test-presubmit	`7de789e`	link	`/test pull-etcd-fmt`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

codecov-io · 2018-10-24T20:21:27Z

Codecov Report

Merging #10211 into master will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #10211      +/-   ##
==========================================
+ Coverage   71.59%   71.71%   +0.11%     
==========================================
  Files         391      391              
  Lines       36386    36386              
==========================================
+ Hits        26051    26094      +43     
+ Misses       8514     8481      -33     
+ Partials     1821     1811      -10

Impacted Files	Coverage Δ
pkg/testutil/recorder.go	`77.77% <0%> (-3.71%)`	⬇️
clientv3/leasing/txn.go	`88.09% <0%> (-3.18%)`	⬇️
raft/progress.go	`94.17% <0%> (-1.95%)`	⬇️
clientv3/balancer/grpc1.7-health.go	`57.84% <0%> (-1.46%)`	⬇️
proxy/grpcproxy/watch.go	`88.55% <0%> (-1.21%)`	⬇️
clientv3/watch.go	`91.1% <0%> (-1.06%)`	⬇️
etcdserver/server.go	`74.39% <0%> (ø)`	⬆️
etcdserver/v3_server.go	`78.71% <0%> (ø)`	⬆️
pkg/adt/interval_tree.go	`88.58% <0%> (+0.6%)`	⬆️
raft/raft.go	`92.14% <0%> (+0.65%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dedae6e...332e0a8. Read the comment docs.

gyuho · 2018-10-25T00:47:29Z

clientv3/integration/black_hole_test.go

+	}
+	clus.Members[1].Blackhole()
+	time.Sleep(10 * time.Second)
+	// remove blackhole but connection should be unavailable now


// remove blackhole but keepalive ping should trigger server to close server-to-client connection?

Sure, that's a better detail.

gyuho · 2018-10-25T00:48:05Z

clientv3/integration/black_hole_test.go

+	time.Sleep(10 * time.Second)
+	// remove blackhole but connection should be unavailable now
+	clus.Members[1].Unblackhole()
+	if _, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1"); err != nil {


_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1") if err == nil { t.Error("rpc error expected") } ev, ok := status.FromError(err) ...

?

Sure that sounds good. Thanks @gyuho !!

gyuho · 2018-10-25T00:48:29Z

clientv3/integration/black_hole_test.go

 )

+func TestServerGRPCKeepAliveTimeout(t *testing.T) {


Also document what this is testing?

Yup, that sounds good.

gyuho · 2018-10-25T04:45:53Z

clientv3/integration/black_hole_test.go

+// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that
+// after having pinged for keepalive check, the server waits for a duration of Timeout and
+// if no activity is seen even after that the connection is closed. The keepalive takes few
+// seconds to get in impact so typically the the time it takes to close the connection is


s/the the/the/ :)

:) Ahh, Yup. Thanks @gyuho

gyuho · 2018-10-25T04:47:17Z

clientv3/integration/black_hole_test.go

@@ -30,6 +30,11 @@ import (
 	"google.golang.org/grpc/status"
 )

+// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that
+// after having pinged for keepalive check, the server waits for a duration of Timeout and
+// if no activity is seen even after that the connection is closed. The keepalive takes few


if no client activity is seen from clients, the connection is closed on server-side, to make it clear it's server-side operation that's being tested.

gyuho

a few nits to make comment clearer, then LGTM thanks!

gyuho · 2018-10-25T04:48:12Z

/assign

gyuho · 2018-10-25T04:48:36Z

/lgtm

jingyih · 2018-10-25T07:54:00Z

clientv3/integration/black_hole_test.go

+		t.Fatal(err)
+	}
+	clus.Members[1].Blackhole()
+	time.Sleep(10 * time.Second)


Looks like the test is flaky with GRPCKeepAliveTimeout set to 1 sec and waiting time set to 10 sec. Can we maybe reduce the GRPCKeepAliveTimeout?
https://travis-ci.com/etcd-io/etcd/jobs/154059557#L936

Sure, thanks! Yes I noticed that another travis job passed but this failed. It's known that gRPC requires extra time (as I have also commented as test doc) so keeping waiting to 10sec but reducing timeout. I tested it multiple times locally and it looks good, hopefully that will be case with travis too :)

Starting 'integration' pass at Thu Oct 25 08:59:54 EDT 2018 Running integration tests... === RUN TestServerGRPCKeepAliveTimeout --- PASS: TestServerGRPCKeepAliveTimeout (10.28s) === RUN TestServerGRPCKeepAliveTimeout --- PASS: TestServerGRPCKeepAliveTimeout (10.28s) === RUN TestServerGRPCKeepAliveTimeout --- PASS: TestServerGRPCKeepAliveTimeout (10.20s) PASS ok go.etcd.io/etcd/clientv3/integration 30.796s Finished 'integration' pass at Thu Oct 25 09:00:42 EDT 2018

jingyih · 2018-10-25T08:01:25Z

clientv3/integration/black_hole_test.go

+		Endpoints: []string{eps[0]},
+	}
+
+	cli, err := clientv3.New(ccfg)


Do we need cli in this test, giving that NewClusterV3() already created clients?

yes, the client is created here. I believe we need this for the desired config. Thanks!

Sorry, I was not able to understand. Do you mean we need this for the desired config for the cluster? My understanding is that the keepAlive parameters were configured on server side. And clus.clients are configured inside NewClusterV3.

Yes, that's what my understanding is - to override defaults used in NewClusterV3. In this particular case, it's just the endpoint though and it's more readable this way. Hope makes sense but I am open for suggestion. Thanks!

I think the endpoints of clus.clients are configured in NewClusterV3:
https://github.com/etcd-io/etcd/blob/master/integration/cluster.go#L740

It is not clear to me what override means. cli and clus.client(1) are two different ClientV3 instances.

Agree, I tried to fix your concern in the new commit. Thanks!

gyuho-bot · 2018-10-25T13:32:05Z

New changes are detected. LGTM label has been removed.

jingyih · 2018-10-26T22:02:26Z

BTW, the test is still flaky with the reduced GRPCKeepAliveTimeout. Might need to investigate further?

gyuho · 2018-10-28T21:34:40Z

/approve cancel

gyuho-bot · 2018-10-28T21:34:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gyuho

If they are not already assigned, you can assign the PR to them by writing /assign @gyuho in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gyuho · 2018-10-28T21:34:59Z

@spzala Yeah, TestServerGRPCKeepAliveTimeout is flaky. Can you take a look?

spzala · 2018-10-29T16:30:20Z

@gyuho @jingyih yup, it is flaky and strange. It looks good when I run the test on my local machine. I will investigate further. If you have any idea or suggestions please let me know. Thanks! (And sorry for delay, I was away in NJ for last 3 days for a family event.)

spzala · 2018-11-06T04:25:31Z

clientv3/integration/black_hole_test.go

-	// close server-to-client connection.
-	clus.Members[1].Unblackhole()
-	_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1")
+	// TODO: keepalive sometimes doesn't work on first attempt.


@gyuho hi, somewhat inspired by this https://github.com/etcd-io/etcd/blob/master/clientv3/integration/black_hole_test.go#L197 I have a for loop added here. Also, the info log should help understand how often the test iterates. Thanks!

Hmmm... when I run test locally it usually pass in a first attempt and sometime more for the third run of test but here it's failing again :( (Outside tests, it runs fine by manually dropping client connection as well.)

@gyuho so this time it did pass build. Seems like 1 cpu is always fine but 2 and 4 doesn't go in first attempt (on my local env it actually does pass mostly). Can we keep this solution and watch it (I will be watching) to see behavior? Thanks!

@gyuho I guess we need a better approach. Thinking but if you have a suggestion please let me know. Thanks!!

spzala · 2018-11-17T20:42:43Z

@gyuho hi, I was at the KubeCon Shanghai and got back today. Please take a look at my precious comment when you get a chance? Thanks!!

gyuho · 2018-11-26T19:25:52Z

@spzala Sorry for delay.

Tests are still failing. Can you try increasing the sleep duration and see if that helps?

spzala · 2018-11-27T01:56:42Z

Thanks @gyuho and no problem! OK, I did but lemme play little more with it.

spzala · 2018-11-29T18:39:41Z

@gyuho hi, increasing sleep time didn't make much different in behavior on my local env so taking a slight different approach. Please take a look and let's see how build goes :). Thanks!

spzala · 2018-11-30T15:03:16Z

@gyuho, hi, fyi, so I don't see build failure this time except suggestion on changing ServerParameters -> Server Parameters Thanks!

gyuho · 2018-11-30T16:57:32Z

Can you add "ServerParameters" to ".words"?

Partially Fixes etcd-io#8645

spzala · 2018-12-02T01:36:28Z

Can you add "ServerParameters" to ".words"?

@gyuho thanks, that helped!

spzala · 2018-12-05T18:02:19Z

@gyuho hi, please let me know what you think of it, no rush :). I think build failure is now not due to the changes. Thanks!

gyuho · 2018-12-05T18:22:22Z

clientv3/integration/black_hole_test.go

+
+	eps := []string{clus.Members[0].GRPCAddr()}
+	ccfg := clientv3.Config{
+		Endpoints:   []string{eps[0]},


Just Endpoints: []string{clus.Members[0].GRPCAddr()}?

gyuho · 2018-12-05T18:23:05Z

clientv3/integration/black_hole_test.go

+			}
+			break
+		} else {
+			t.Logf("info: expected an error, received none.")


Just t.Log("expected an error from server, but received none")

gyuho · 2018-12-05T18:23:52Z

@spzala Sorry for delay.

@jingyih @hexfusion Can you take a look at this as well?

Thanks!

hexfusion · 2018-12-05T21:14:28Z

@gyuho @spzala locally I am still having a 50%+ failure rate on this test and would like to explore further.

spzala · 2018-12-06T14:37:24Z

@gyuho no problem at all :-) and thanks!! @hexfusion hmm.. on my local VM it usually works for me (not always but failure is once in a while) but yes, I am up exploring it further. Thanks!

spzala · 2018-12-27T18:25:50Z

@jingyih @hexfusion as we discussed at the kubecon, if you can try solve the mystery with this flaky test in your local env and co-author the PR that would be great. The test seems pass first time but then it fails. Thanks!!

hexfusion · 2018-12-27T18:34:16Z

Been traveling but will dig in after the new year.

On Thu, Dec 27, 2018 at 1:26 PM Sahdev Zala ***@***.***> wrote: @jingyih <https://github.com/jingyih> @hexfusion <https://github.com/hexfusion> as we discussed at the kubecon, if you can try solve the mystery with this flaky test in your local env and co-author the PR that would be great. The test seems pass first time but then it fails. Thanks!! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10211 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABMR1Zn4gANKYeySCkon6KvR2QOvYX61ks5u9RC3gaJpZM4X4npc> .

-- Best, Sam Batschelet

stale · 2020-04-06T19:55:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gyuho-bot requested review from gyuho and hexfusion October 24, 2018 19:13

gyuho-bot added the size/M label Oct 24, 2018

gyuho reviewed Oct 25, 2018

View reviewed changes

gyuho approved these changes Oct 25, 2018

View reviewed changes

gyuho-bot assigned gyuho Oct 25, 2018

gyuho-bot added lgtm labels Oct 25, 2018

jingyih reviewed Oct 25, 2018

View reviewed changes

gyuho-bot removed the lgtm label Oct 25, 2018

gyuho-bot removed the approved label Oct 28, 2018

spzala commented Nov 6, 2018

View reviewed changes

spzala force-pushed the keepalivetest branch from 4cb9fe8 to c1b0979 Compare November 6, 2018 16:58

spzala force-pushed the keepalivetest branch from c1b0979 to b8acf37 Compare November 25, 2018 06:02

spzala force-pushed the keepalivetest branch from b8acf37 to eb552fe Compare November 25, 2018 14:36

spzala added 4 commits November 30, 2018 15:01

ClientV3: Add integration test for server timeout

0187061

Partially Fixes etcd-io#8645

ClientV3: Add integration test for server timeout

3f6d508

Partially Fixes etcd-io#8645

ClientV3: Add integration test for server timeout

e48af62

Partially Fixes etcd-io#8645

ClientV3: Add integration test for server timeout

01afcf1

Partially Fixes etcd-io#8645

spzala force-pushed the keepalivetest branch 2 times, most recently from 1f95e95 to de0ecf3 Compare November 30, 2018 20:05

ClientV3: Add integration test for server timeout

332e0a8

Partially Fixes etcd-io#8645

spzala force-pushed the keepalivetest branch from de0ecf3 to 332e0a8 Compare December 1, 2018 02:04

gyuho reviewed Dec 5, 2018

View reviewed changes

stale bot added the stale label Apr 6, 2020

stale bot closed this Apr 27, 2020

ClientV3: Add integration test for server timeout #10211

ClientV3: Add integration test for server timeout #10211

Conversation

spzala commented Oct 24, 2018

spzala commented Oct 24, 2018

gyuho-bot commented Oct 24, 2018

codecov-io commented Oct 24, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho left a comment

Choose a reason for hiding this comment

gyuho commented Oct 25, 2018

gyuho commented Oct 25, 2018

Choose a reason for hiding this comment

spzala Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spzala Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho-bot commented Oct 25, 2018

jingyih commented Oct 26, 2018

gyuho commented Oct 28, 2018

gyuho-bot commented Oct 28, 2018

gyuho commented Oct 28, 2018

spzala commented Oct 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spzala commented Nov 17, 2018

gyuho commented Nov 26, 2018

spzala commented Nov 27, 2018

spzala commented Nov 29, 2018

spzala commented Nov 30, 2018

gyuho commented Nov 30, 2018

spzala commented Dec 2, 2018

spzala commented Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

gyuho Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

gyuho commented Dec 5, 2018

hexfusion commented Dec 5, 2018

spzala commented Dec 6, 2018

spzala commented Dec 27, 2018

hexfusion commented Dec 27, 2018 via email

stale bot commented Apr 6, 2020

codecov-io commented Oct 24, 2018 •

edited

Loading

spzala Oct 25, 2018 •

edited

Loading

spzala Oct 25, 2018 •

edited

Loading

spzala commented Dec 5, 2018 •

edited

Loading

gyuho Dec 5, 2018 •

edited

Loading