Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientV3: Add integration test for server timeout #10211

Closed
wants to merge 5 commits into from

Conversation

spzala
Copy link
Member

@spzala spzala commented Oct 24, 2018

Add integration test for server timeout.

Partially Fixes #8645

@spzala
Copy link
Member Author

spzala commented Oct 24, 2018

/retest

@gyuho-bot
Copy link

@spzala: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
test-presubmit 7de789e link /test pull-etcd-fmt

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@codecov-io
Copy link

codecov-io commented Oct 24, 2018

Codecov Report

Merging #10211 into master will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10211      +/-   ##
==========================================
+ Coverage   71.59%   71.71%   +0.11%     
==========================================
  Files         391      391              
  Lines       36386    36386              
==========================================
+ Hits        26051    26094      +43     
+ Misses       8514     8481      -33     
+ Partials     1821     1811      -10
Impacted Files Coverage Δ
pkg/testutil/recorder.go 77.77% <0%> (-3.71%) ⬇️
clientv3/leasing/txn.go 88.09% <0%> (-3.18%) ⬇️
raft/progress.go 94.17% <0%> (-1.95%) ⬇️
clientv3/balancer/grpc1.7-health.go 57.84% <0%> (-1.46%) ⬇️
proxy/grpcproxy/watch.go 88.55% <0%> (-1.21%) ⬇️
clientv3/watch.go 91.1% <0%> (-1.06%) ⬇️
etcdserver/server.go 74.39% <0%> (ø) ⬆️
etcdserver/v3_server.go 78.71% <0%> (ø) ⬆️
pkg/adt/interval_tree.go 88.58% <0%> (+0.6%) ⬆️
raft/raft.go 92.14% <0%> (+0.65%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dedae6e...332e0a8. Read the comment docs.

}
clus.Members[1].Blackhole()
time.Sleep(10 * time.Second)
// remove blackhole but connection should be unavailable now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// remove blackhole but keepalive ping should trigger server to close server-to-client connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a better detail.

time.Sleep(10 * time.Second)
// remove blackhole but connection should be unavailable now
clus.Members[1].Unblackhole()
if _, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1")
if err == nil {
  t.Error("rpc error expected")
}
ev, ok := status.FromError(err)
...

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that sounds good. Thanks @gyuho !!

)

func TestServerGRPCKeepAliveTimeout(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also document what this is testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that sounds good.

// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that
// after having pinged for keepalive check, the server waits for a duration of Timeout and
// if no activity is seen even after that the connection is closed. The keepalive takes few
// seconds to get in impact so typically the the time it takes to close the connection is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the the/the/ :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Ahh, Yup. Thanks @gyuho

@@ -30,6 +30,11 @@ import (
"google.golang.org/grpc/status"
)

// TestServerGRPCKeepAliveTimeout tests ServerParameters keepalive ping on server-side that
// after having pinged for keepalive check, the server waits for a duration of Timeout and
// if no activity is seen even after that the connection is closed. The keepalive takes few
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no client activity is seen from clients, the connection is closed on server-side, to make it clear it's server-side operation that's being tested.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits to make comment clearer, then LGTM thanks!

@gyuho
Copy link
Contributor

gyuho commented Oct 25, 2018

/assign

@gyuho
Copy link
Contributor

gyuho commented Oct 25, 2018

/lgtm

t.Fatal(err)
}
clus.Members[1].Blackhole()
time.Sleep(10 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test is flaky with GRPCKeepAliveTimeout set to 1 sec and waiting time set to 10 sec. Can we maybe reduce the GRPCKeepAliveTimeout?
https://travis-ci.com/etcd-io/etcd/jobs/154059557#L936

Copy link
Member Author

@spzala spzala Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks! Yes I noticed that another travis job passed but this failed. It's known that gRPC requires extra time (as I have also commented as test doc) so keeping waiting to 10sec but reducing timeout. I tested it multiple times locally and it looks good, hopefully that will be case with travis too :)

Starting 'integration' pass at Thu Oct 25 08:59:54 EDT 2018
Running integration tests...
=== RUN   TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.28s)
=== RUN   TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.28s)
=== RUN   TestServerGRPCKeepAliveTimeout
--- PASS: TestServerGRPCKeepAliveTimeout (10.20s)
PASS
ok  	go.etcd.io/etcd/clientv3/integration	30.796s
Finished 'integration' pass at Thu Oct 25 09:00:42 EDT 2018

Endpoints: []string{eps[0]},
}

cli, err := clientv3.New(ccfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need cli in this test, giving that NewClusterV3() already created clients?

Copy link
Member Author

@spzala spzala Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the client is created here. I believe we need this for the desired config. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was not able to understand. Do you mean we need this for the desired config for the cluster? My understanding is that the keepAlive parameters were configured on server side. And clus.clients are configured inside NewClusterV3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what my understanding is - to override defaults used in NewClusterV3. In this particular case, it's just the endpoint though and it's more readable this way. Hope makes sense but I am open for suggestion. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the endpoints of clus.clients are configured in NewClusterV3:
https://github.com/etcd-io/etcd/blob/master/integration/cluster.go#L740

It is not clear to me what override means. cli and clus.client(1) are two different ClientV3 instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I tried to fix your concern in the new commit. Thanks!

@gyuho-bot gyuho-bot removed the lgtm label Oct 25, 2018
@gyuho-bot
Copy link

New changes are detected. LGTM label has been removed.

@jingyih
Copy link
Contributor

jingyih commented Oct 26, 2018

BTW, the test is still flaky with the reduced GRPCKeepAliveTimeout. Might need to investigate further?

@gyuho
Copy link
Contributor

gyuho commented Oct 28, 2018

/approve cancel

@gyuho-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gyuho

If they are not already assigned, you can assign the PR to them by writing /assign @gyuho in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gyuho
Copy link
Contributor

gyuho commented Oct 28, 2018

@spzala Yeah, TestServerGRPCKeepAliveTimeout is flaky. Can you take a look?

@spzala
Copy link
Member Author

spzala commented Oct 29, 2018

@gyuho @jingyih yup, it is flaky and strange. It looks good when I run the test on my local machine. I will investigate further. If you have any idea or suggestions please let me know. Thanks! (And sorry for delay, I was away in NJ for last 3 days for a family event.)

// close server-to-client connection.
clus.Members[1].Unblackhole()
_, err = clus.Client(1).Put(context.TODO(), "foo1", "bar1")
// TODO: keepalive sometimes doesn't work on first attempt.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho hi, somewhat inspired by this https://github.com/etcd-io/etcd/blob/master/clientv3/integration/black_hole_test.go#L197 I have a for loop added here. Also, the info log should help understand how often the test iterates. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... when I run test locally it usually pass in a first attempt and sometime more for the third run of test but here it's failing again :( (Outside tests, it runs fine by manually dropping client connection as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho so this time it did pass build. Seems like 1 cpu is always fine but 2 and 4 doesn't go in first attempt (on my local env it actually does pass mostly). Can we keep this solution and watch it (I will be watching) to see behavior? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho I guess we need a better approach. Thinking but if you have a suggestion please let me know. Thanks!!

@spzala
Copy link
Member Author

spzala commented Nov 17, 2018

@gyuho hi, I was at the KubeCon Shanghai and got back today. Please take a look at my precious comment when you get a chance? Thanks!!

@gyuho
Copy link
Contributor

gyuho commented Nov 26, 2018

@spzala Sorry for delay.

Tests are still failing. Can you try increasing the sleep duration and see if that helps?

@spzala
Copy link
Member Author

spzala commented Nov 27, 2018

Thanks @gyuho and no problem! OK, I did but lemme play little more with it.

@spzala
Copy link
Member Author

spzala commented Nov 29, 2018

@gyuho hi, increasing sleep time didn't make much different in behavior on my local env so taking a slight different approach. Please take a look and let's see how build goes :). Thanks!

@spzala
Copy link
Member Author

spzala commented Nov 30, 2018

@gyuho, hi, fyi, so I don't see build failure this time except suggestion on changing ServerParameters -> Server Parameters Thanks!

@gyuho
Copy link
Contributor

gyuho commented Nov 30, 2018

Can you add "ServerParameters" to ".words"?

@spzala spzala force-pushed the keepalivetest branch 2 times, most recently from 1f95e95 to de0ecf3 Compare November 30, 2018 20:05
@spzala
Copy link
Member Author

spzala commented Dec 2, 2018

Can you add "ServerParameters" to ".words"?

@gyuho thanks, that helped!

@spzala
Copy link
Member Author

spzala commented Dec 5, 2018

@gyuho hi, please let me know what you think of it, no rush :). I think build failure is now not due to the changes. Thanks!


eps := []string{clus.Members[0].GRPCAddr()}
ccfg := clientv3.Config{
Endpoints: []string{eps[0]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just Endpoints: []string{clus.Members[0].GRPCAddr()}?

}
break
} else {
t.Logf("info: expected an error, received none.")
Copy link
Contributor

@gyuho gyuho Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just t.Log("expected an error from server, but received none")

@gyuho
Copy link
Contributor

gyuho commented Dec 5, 2018

@spzala Sorry for delay.

@jingyih @hexfusion Can you take a look at this as well?

Thanks!

@hexfusion
Copy link
Contributor

@gyuho @spzala locally I am still having a 50%+ failure rate on this test and would like to explore further.

@spzala
Copy link
Member Author

spzala commented Dec 6, 2018

@gyuho no problem at all :-) and thanks!! @hexfusion hmm.. on my local VM it usually works for me (not always but failure is once in a while) but yes, I am up exploring it further. Thanks!

@spzala
Copy link
Member Author

spzala commented Dec 27, 2018

@jingyih @hexfusion as we discussed at the kubecon, if you can try solve the mystery with this flaky test in your local env and co-author the PR that would be great. The test seems pass first time but then it fails. Thanks!!

@hexfusion
Copy link
Contributor

hexfusion commented Dec 27, 2018 via email

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 6, 2020
@stale stale bot closed this Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

grpc keepalive: test server-to-client HTTP/2 pings
6 participants