-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: gossip encryption key rotation #336
Conversation
Looks awesome! Thanks for tackling this! The workflow LGTM, let me know when you feel it's complete and I'll give it a one-over. |
I'm still hacking around on this. I realized while testing in multi-DC setups that the implementation was a bit flawed, and should probably more closely resemble the UserEvent RPC methods so that forwarding key events to remote datacenters is possible. I'll take a crack at that and update here. |
@ryanuber Oh since using the event on the WAN won't rotate the remote LAN's? |
@armon yeah exactly. Initially I was thinking we would just have new keys gossiped to all of the nodes and a collective response returned, which seems desirable from the operator's perspective. The alternative is to have like a |
Yeah I like your initial approach. You are right, it just is a lot nicer to have the gossip handle it. I think everything you've proposed will work fine. |
// server in each datacenter. This will only error for RPC-related errors. | ||
// Otherwise, application-level errors are returned inside of the inner | ||
// response objects. | ||
func (s *Server) keyringRPC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is more generally useful from key rings. I like having a helper to just "broadcast" the RPC to all the known DC's. Maybe we can make it like a globalRPC instead of keyringRPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should move the RPC methods to consul/rpc.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pass at generalizing this. I ended up using an interface for the RPC replies, structs.CompoundResponse
, which has an Add(interface{})
method for appending responses. The reason for this is because we are expecting one response per DC, and we will just never know what type of response struct to expect back. Doing a type assertion for it inside of globalRPC
felt ghetto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. Another option is that the response could be a pointer to a slice and we just append responses to it, but this I think is cleaner.
Awesome. Just gave it a one over and it looks really good. I think there is a bit of reshuffling of methods but it looks super solid. I think the only broader feedback is that lets enable keyring by default with encryption is enabled. This way the user experience is even nicer, "it just works". |
Thanks, @armon! Always appreciate your code reviews. I left a few comments inline above regarding the generalized |
This all makes sense! Looking good. I noticed that globalRPC guards against the single DC case, but I wonder if what we do is just have global DC also just invoke for the local DC. This way the caller of globalDC doesn't need the split logic of calling the local DC + remote DC, it just calls global and assumes all the DCs (including local) were hit. |
For the I had an idea last night to allow passing the path to a file on agent start. If the file indicated exists on agent startup, we just consume a key out of it to seed the keyring, and delete the file. Otherwise the agent can just start normally if the file is missing. That would allow us to always start consul in the same way without having to put a key in our config or pass it on the command line, and disambiguates how you go about initializing the keyring. I imagine we could also just have the seed file be generated with What do you think? |
Actually, thinking about it a little more, this probably isn't any better than just shipping the keyring file directly to its final spot. I'll think more about this. |
Yeah this is a hard one... I think as a user, you want to be able to specify
Thoughts? |
Looks like you read my mind - that's exactly what I just pushed. We append but don't change primary if |
@armon I think this is better now! There are just a couple of things I want to run by you.
|
I might have just answered my own question. I think 2. is just because the WAN pool in this scenario is 5x larger than the local little LAN pools and involves remote members. Probably a non-issue. |
Awesome! I'll give it another look over soon. I just discussed 1 with @mitchellh, and he had a good idea. Lets only use the With 2, it just is the WAN timing is much slower. Gossip just takes longer, so I think that is fine. With 3, that sounds great. Makes sense. |
Sounds good! I made some adjustments to get this working - I'll fix the tests to accommodate tonight. |
…en using agent -encrypt
Just rebased this thing against master. I'll do a sanity check on it as well, but tests are all passing so I think we should be OK. |
Feel free to merge on in! |
feature: gossip encryption key rotation
Awesome, this thing is a beast! |
- defaults to true - replaces ENABLE_WEBHOOKS environment variable
Since Serf supports symmetric key rotation for gossip messages, Consul can leverage this to allow key rotation for both the LAN and WAN gossip pools.
Because of the multiple gossip pools and a few other semantic differences between Serf and Consul, there are a few things to note about this implementation:
consul keyring
to avoid confusion with TLS keys or KV keys.consul keyring
on a server vs. a client.-encrypt
argument still works with unmodified behaviorSome examples of the feature in action in a cluster with one server and one client are in this gist.
I've still got a few more tests and docs to write, but would like feedback on the overall flow of how this is working in its current form.
Related to #246