Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3, etcdmain, proxy: support authed RPCs with grpcproxy #8289

Merged
merged 4 commits into from
Sep 22, 2017

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Jul 20, 2017

This commit lets grpcproxy support authed RPCs. Auth tokens supplied
by clients are now forwarded to etcdserver by grpcproxy.

Fix #8247

/cc @zhuqunzhou @heyitsanthony

This PR is still WIP because I'm considering how to protect proxy based watch. It isn't simple like other RPCs because of its streaming and broadcast nature.

@zhuqunzhou could you try it for your purpose? Authed KV requests with proxy should work now.

@@ -175,3 +181,33 @@ func (cp *clusterProxy) MemberList(ctx context.Context, r *pb.MemberListRequest)
resp := (pb.MemberListResponse)(*mresp)
return &resp, err
}

func (ap *authedClusterProxy) MemberAdd(ctx context.Context, r *pb.MemberAddRequest) (*pb.MemberAddResponse, error) {
token := getAuthTokenFromClient(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are very invasive; would it be possible to use a unary interceptor instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll try the interceptor mechanism.

@mitake mitake force-pushed the auth-proxy branch 2 times, most recently from f84dc07 to 0fdbaef Compare July 24, 2017 09:53
@mitake
Copy link
Contributor Author

mitake commented Jul 24, 2017

@heyitsanthony I followed the interceptor way, it is quite convenient! I'm still considering how to handle the watch & auth combination. Possibly OCC styled checking like 1. check permission with range 2. watch until something happen 3. check the permission again, would work but I'm not sure this is the best solution yet.

@heyitsanthony
Copy link
Contributor

@mitake as a first attempt, the watch permission checking could simply check at time of watch creation like the server does now

@mitake
Copy link
Contributor Author

mitake commented Jul 25, 2017

@heyitsanthony I thought the checking during watch creation doesn't work well because the result from the watch can be forwarded to other clients. And the clients would be owned by different users.

Because of the same reason, caching range results for serializable clients in the grpcproxy wouldn't be able to coexist with auth. Probably an option for turning off the caching would be required?

@mitake
Copy link
Contributor Author

mitake commented Jul 25, 2017

Or letting jwt tokens have permission information and check it in proxy would be reasonable? https://github.com/coreos/etcd/blob/master/auth/jwt.go#L66

Clearly it makes tokens bloated, but is useful for handling watch in proxy. simple token isn't for serious use and CN based auth cannot work with proxy because clients shouldn't share a single CN.

@heyitsanthony
Copy link
Contributor

@mitake I was thinking at watch creation time the proxy could get user / role information from etcd to determine if the user is authorized for reading the keys it's requesting before attaching to any watch that's already open.

Caching could be made safe by having a separate cache per user.

@mitake
Copy link
Contributor Author

mitake commented Jul 27, 2017

@heyitsanthony even a user has a permission at the first time, the permission can be dropped after issuing a watch RPC. So I thought the OCC styled double checking would be required.

Per user caching seems to be nice. But this idea would have a similar problem like the above. A user can be deleted and a new user with the name of the deleted user can be created.

Of course they are corner cases :)

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Jul 27, 2017

A user can be deleted and a new user with the name of the deleted user can be created.

Perhaps there's some way to associate it with auth tokens? If the auth token changes, check to see if the user still has authorization. Serving cached data after token expiration should be OK since all the data in the cache was acquired using that expired token in the first place.

even a user has a permission at the first time, the permission can be dropped after issuing a watch RPC.

The etcd server has this same issue, right? A user can create a watch and the permission can be dropped but it'll keep streaming keys. I'm not too worried about this for the time being; binding the permission at the time of acquiring a resource is common, although not ideal. Or am I missing something?

A stream interceptor could handle cutting off the watch. At the moment, I can't think of a simple way to make it consistent with auth that wouldn't involve issuing auth rpcs for every watch response.

@mitake
Copy link
Contributor Author

mitake commented Jul 27, 2017

@heyitsanthony on the second thought, watch RPCs don't involve Raft so the very strict checking would be overkill. As you say, the stream interceptor approach would be reasonable. I'll try it. Thanks for your comments.

@mitake
Copy link
Contributor Author

mitake commented Jul 27, 2017

BTW, I'm still thinking about grpcproxy's caching mechanism (it is not related to watch). If a Range request is serializable, a proxy tries to return its cached result. But this cached result wouldn't be invalidated by Put requests that bypass the proxy (e.g. Put requests served by etcd directly or served by other proxies). This semantics is weaker than serializable. Is this ok?

@heyitsanthony
Copy link
Contributor

It's too weak for many purposes. #7624 would get serializability for real.

@mitake
Copy link
Contributor Author

mitake commented Jul 27, 2017

I see, thanks for the pointer.

@mitake
Copy link
Contributor Author

mitake commented Jul 28, 2017

@heyitsanthony I supported watch RPC in the latest version. It couldn't be implemented in a non intrusive manner like other RPCs. This is because:

  1. The newly added stream client interceptor: https://github.com/coreos/etcd/pull/8289/files#diff-4b018c1eecd18acda2df8c535d6bda8dR57 couldn't obtain a context that has an auth token as its metadata. Because of this reason, the caller site of the watch RPC manually attaches the token like this: https://github.com/coreos/etcd/pull/8289/files#diff-6b34059c09af13ee01755344d47bf9b0R62
  2. Some clients can be a participant to an existing broadcast that is already authorized as its first client. In such a case, a client that doesn't have a permission of watched range can read them if the client are not authorized individually. I added this checking mechanism like this: https://github.com/coreos/etcd/pull/8289/files#diff-36718647f54de9509ca1898255badf11R223

@mitake
Copy link
Contributor Author

mitake commented Jul 28, 2017

@heyitsanthony I'll add test cases for covering the new changes. Can we have an agreement about the intrusive change required for supporting watch?

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Jul 31, 2017

@mitake this approach looks OK. Try removing the !cluster_proxy build tag in https://github.com/coreos/etcd/blob/master/e2e/ctl_v3_auth_test.go#L16 to enable proxy e2e auth tests.

@mitake
Copy link
Contributor Author

mitake commented Aug 1, 2017

@heyitsanthony ah I see. I revived the tests with the latest commit.

@mitake
Copy link
Contributor Author

mitake commented Aug 24, 2017

@heyitsanthony it seems that the previous CI failures were caused by slowness of the server. I rebased and updated this PR. Could you take a look if the latest CI goes well?

@mitake
Copy link
Contributor Author

mitake commented Aug 24, 2017

hmm... the CIs are causing unseen errors. I'm checking them.

@mitake
Copy link
Contributor Author

mitake commented Aug 25, 2017

@heyitsanthony I found a silly mistake in handling snapshot RPC and solved it in the new commit (9e6e21c). It would solve the CI failure problem. Could you take a look after CI passing?

@mitake
Copy link
Contributor Author

mitake commented Aug 25, 2017

Something is still broken in integration tests... I'm looking at it.

@mitake
Copy link
Contributor Author

mitake commented Aug 28, 2017

@heyitsanthony I found that the clientv3's integration pass caused a failure because of its too short timeout value. I changed it with the latest commit: 22abc7a . I want to watch what happen on the CI servers with the change.

@mitake
Copy link
Contributor Author

mitake commented Aug 28, 2017

@heyitsanthony I checked the latest CI failures (which produce very lengthy logs!) and found that the bloated error logs came from the failure of TestV3WatchFromCurrentRevision. The failure paths of the test don't clean resources including clients, so succeeding tests's AfterTests() detects the leaked goroutines which are not related to them.

The failure of TestV3WatchFromCurrentRevision seems to be caused by CPU shortage or something like that because the test can be passed on my local environment. Is it possible to restart the CIs?

@mitake mitake force-pushed the auth-proxy branch 3 times, most recently from 75c163e to c16785c Compare August 31, 2017 09:38
@mitake mitake force-pushed the auth-proxy branch 2 times, most recently from 0019a99 to 267b9f5 Compare September 1, 2017 09:16
@mitake
Copy link
Contributor Author

mitake commented Sep 1, 2017

@heyitsanthony the CIs are still failing but the latest failures wouldn't be related to this PR:

  1. build failure? https://jenkins-etcd-public.prod.coreos.systems/job/etcd-coverage/2208/console
  2. TestV3CorruptAlarm failure: https://jenkins-etcd-public.prod.coreos.systems/job/etcd-ci-ppc64/2092/console . I couldn't reproduced it on my local environment with 10 times iteration.
  3. TestV3KVInflightRangeRequests failure: https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/2610/console . It would be caused by boltdb? This failure couldn't be reproduced on my environment, too.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd7b100]

goroutine 117006 [running]:
github.com/coreos/bbolt.(*Tx).Bucket(0x0, 0x19ae35a, 0x3, 0x3, 0x1a299c0)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/bbolt/tx.go:101 +0x30
github.com/coreos/etcd/mvcc/backend.(*readTx).UnsafeRange(0xc4203ad220, 0x19ae35a, 0x3, 0x3, 0xc42a1f6580, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/mvcc/backend/read_tx.go:75 +0x826
github.com/coreos/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc423880d80, 0xc4227fa600, 0x12, 0x20, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, ...)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/mvcc/kvstore_txn.go:140 +0x826
github.com/coreos/etcd/mvcc.(*storeTxnRead).Range(0xc423880d80, 0xc4227fa600, 0x12, 0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4238a3100, ...)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/mvcc/kvstore_txn.go:45 +0xe3
github.com/coreos/etcd/mvcc.(*txnReadWrite).Range(0xc4203b40a0, 0xc4227fa600, 0x12, 0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x452600, ...)
	<autogenerated>:30 +0xf7
github.com/coreos/etcd/mvcc.(*metricsTxnWrite).Range(0xc423880db0, 0xc4227fa600, 0x12, 0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423cb2e00, ...)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/mvcc/metrics_txn.go:38 +0x12f
github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc423ef8860, 0x19be600, 0xc423880db0, 0xc42d468000, 0x0, 0x0, 0x0)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/apply.go:263 +0x323
github.com/coreos/etcd/etcdserver.(*EtcdServer).Range.func2()
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/v3_server.go:99 +0xa6
github.com/coreos/etcd/etcdserver.(*EtcdServer).doSerialize(0xc42014f680, 0x7f58d5f733b8, 0xc422a5f620, 0xc4227fa920, 0xc4238a3548, 0x5, 0x5bb819bddda23564)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/v3_server.go:531 +0x15a
github.com/coreos/etcd/etcdserver.(*EtcdServer).Range(0xc42014f680, 0x7f58d5f733b8, 0xc422a5f620, 0xc42d468000, 0x0, 0x12e1220, 0x48674c)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/v3_server.go:100 +0x25a
github.com/coreos/etcd/etcdserver/api/v3rpc.(*kvServer).Range(0xc420931d00, 0x7f58d5f733b8, 0xc422a5f620, 0xc42d468000, 0x12dc180, 0xc4238a3600, 0x19bfb00)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/api/v3rpc/key.go:50 +0xbc
github.com/coreos/etcd/etcdserver/api/v3rpc.(*quotaKVServer).Range(0xc420931e00, 0x7f58d5f733b8, 0xc422a5f620, 0xc42d468000, 0xc420931e00, 0x5, 0x13eacfe)
	<autogenerated>:26 +0x87
github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler.func1(0x7f58d5f733b8, 0xc422a5f620, 0x13ac220, 0xc42d468000, 0xc420da8f00, 0xc429eebb80, 0x7, 0xc429eebb87)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3308 +0xa2
github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x7f58d5f733b8, 0xc422a5f620, 0x13ac220, 0xc42d468000, 0xc4227fa8c0, 0xc4227fa900, 0xc420012c00, 0xc420012c00, 0x47fc00, 0xc420012c00)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xfb
github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f58d5f733b8, 0xc422a5f620, 0x13ac220, 0xc42d468000, 0xc4227fa8c0, 0xc4227fa900, 0x0, 0x437387, 0xc420012c00, 0xc420012c00)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0x106
github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Range_Handler(0x12dc180, 0xc420931e00, 0x7f58d5f733b8, 0xc422a5f620, 0xc420da8eb0, 0xc42ab81ca0, 0x0, 0x0, 0xc423ca0800, 0x0)
	/home/jenkins/workspace/etcd-proxy/gopath/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3310 +0x1fc
google.golang.org/grpc.(*Server).processUnaryRPC(0xc420189b80, 0x19c2b60, 0xc4221f5080, 0xc42049d000, 0xc429072e70, 0x19fcde0, 0xc429fbe5a0, 0x0, 0x0)
	/home/jenkins/workspace/etcd-proxy/gopath/src/google.golang.org/grpc/server.go:781 +0x1222
google.golang.org/grpc.(*Server).handleStream(0xc420189b80, 0x19c2b60, 0xc4221f5080, 0xc42049d000, 0xc429fbe5a0)
	/home/jenkins/workspace/etcd-proxy/gopath/src/google.golang.org/grpc/server.go:981 +0x150f
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4238da340, 0xc420189b80, 0x19c2b60, 0xc4221f5080, 0xc42049d000)
	/home/jenkins/workspace/etcd-proxy/gopath/src/google.golang.org/grpc/server.go:551 +0xb7
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/home/jenkins/workspace/etcd-proxy/gopath/src/google.golang.org/grpc/server.go:552 +0xb9

Possibly 2nd and 3rd problems need issues?

Also I resolved all failed tests related to this PR. The main reason of the failures was breaking watch API's semantics by newly added permission checking in grpcproxy. It took very long because of two test cases so I describe the summary below:

  1. TestWatchWithRequireLeader: the newly added permission checking (watchProxyStream.checkPermissionForWatch) issued its range request as linearizable one. It doesn't work well if a cluster doesn't have its leader. It is solved by turning on Serializable flag of the range request here: https://github.com/coreos/etcd/pull/8289/files#diff-36718647f54de9509ca1898255badf11R212.
  2. TestWatchErrConnClosed: the permission checking was returning its own WatchResponse even if an error is not related to permission. It breaks the semantics so I added a new condition for limiting the WatchResponse path: https://github.com/coreos/etcd/pull/8289/files#diff-36718647f54de9509ca1898255badf11R230.

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2017

@heyitsanthony I looked at the latest CI failures for a while but they seem to be not related with this PR (not 100% sure, though). Could you take a look when you have a time?

Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits

Serializable: true,
Key: key,
RangeEnd: rangeEnd,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountOnly: 1, Limit: 1 so this doesn't fetch a lot of data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update for the comments soon.

@@ -59,6 +59,12 @@ func newWatchBroadcast(wp *watchProxy, w *watcher, update func(*watchBroadcast))
clientv3.WithCreatedNotify(),
}

// Forward a token from client to server.
token := getAuthTokenFromClient(w.wps.stream.Context())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code happens twice, should probably have a function like:
cctx = withClientAuthToken(cctx, w.wps.stream.Context())

@@ -0,0 +1,64 @@
// Copyright 2016 The etcd Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017

package grpcproxy

import (
"golang.org/x/net/context"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"context"

This commit lets grpcproxy support authed RPCs. Auth tokens supplied
by clients are now forwarded to etcdserver by grpcproxy.
@mitake mitake changed the title WIP: clientv3, etcdmain, proxy: support authed RPCs with grpcproxy clientv3, etcdmain, proxy: support authed RPCs with grpcproxy Sep 20, 2017
This commit lets grpcproxy handle authed watch. The main changes are:
1. forwrading a token of a new broadcast client
2. checking permission of a new client that participates to an
   existing broadcast
Like the previous commit 10f783e, this commit lets grpcproxy
forward an auth token supplied by its client in an explicit
manner. snapshot is a stream RPC so this process is required like
watch.
@mitake
Copy link
Contributor Author

mitake commented Sep 20, 2017

@heyitsanthony updated for the latest comments, could you take a look?

It seems that CIs are failing because of environmental specific issues.

  1. jenkins-cov: TestV3AuthMemberUpdate is failing because of an error caused during etcd cluster creation: https://jenkins-etcd-public.prod.coreos.systems/job/etcd-coverage/2365/console
  2. jenkins-proxy-ci: failed at launching etcd cluster: https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/2773/console
  3. semaphoreci: CI was interrupted by machine reboot: https://semaphoreci.com/coreos/etcd/branches/pull-request-8289/builds/15
--- PASS: TestCtlV3Migrate (1.10s)
=== RUN   TestCtlV3MoveLeader
�

Broadcast message from root@semaphore-1707

	(unknown) at 6:39 ...




The system is going down for power off NOW!

--- PASS: TestCtlV3MoveLeader (2.18s)
=== RUN   TestCtlV3RoleAdd

Probably 1 and 2 were caused by resource shortage? I couldn't reproduce the errors on my env, of course.

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2017

lgtm. CI failure is not related. Merging.

@xiang90
Copy link
Contributor

xiang90 commented Sep 21, 2017

@mitake

Actually I reran the semaphoreci again. It is still failing. So it is probably a legit issue.

@gyuho
Copy link
Contributor

gyuho commented Sep 21, 2017

Error: rpc error: code = Unavailable desc = grpc: the connection is closing errors from e2e tests. Same as #8581. I am investigating it.

@mitake
Copy link
Contributor Author

mitake commented Sep 22, 2017

@gyuho @xiang90 I'm not sure the latest error in semaphore is same to #8581 . In this PR, the error is caused by TestCtlV2BackupSnapshot, it is not related to TestCtlV3DelTimeout. In addition, the succeeding tests of TestCtlV2BackupSnapshot in this PR are failing in the etcd cluster creation. I'm considering the possibility of disk full or something like that in semaphore? Actually I cannot reproduce the error of the tests on my local env. Can you reproduce the errors on your env, @gyuho ?

@mitake
Copy link
Contributor Author

mitake commented Sep 22, 2017

On the master branch, I could reproduce a little bit similar (but different) e2e failure like below:

=== RUN   TestCtlV2BackupV3Snapshot
--- FAIL: TestCtlV2BackupV3Snapshot (2.10s)
        ctl_v2_test.go:280: read /dev/ptmx: input/output error (expected "bar", got ["Error:  100: Key not found (/foo1) [6]\r\n"])
=== RUN   TestCtlV2AuthWithCommonName
--- FAIL: TestCtlV2AuthWithCommonName (0.01s)
        ctl_v2_test.go:510: could not start etcd process cluster (read /dev/ptmx: input/output error)
=== RUN   TestCtlV2ClusterHealth
--- FAIL: TestCtlV2ClusterHealth (0.06s)
        ctl_v2_test.go:510: could not start etcd process cluster (read /dev/ptmx: input/output error)
=== RUN   TestCtlV3Alarm
--- FAIL: TestCtlV3Alarm (0.03s)
        ctl_v3_test.go:137: could not start etcd process cluster (read /dev/ptmx: input/output error)
...

A single test (in this case, TestCtlV2BackupV3Snapshot) fails then succeeding tests also fail... I'm investigating this problem.

@gyuho
Copy link
Contributor

gyuho commented Sep 22, 2017

@mitake I can confirm it's not related to this change. It's from our recent client balancer patch.

@xiang90
Copy link
Contributor

xiang90 commented Sep 22, 2017

ok. let us merge this in.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@mitake
Copy link
Contributor Author

mitake commented Sep 22, 2017

@xiang90 @gyuho thanks, then I'm merging it.

@mitake mitake merged commit 6515a1d into etcd-io:master Sep 22, 2017
@mitake mitake deleted the auth-proxy branch September 22, 2017 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants