-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC2 Auto-Discovery: EICE support #35504
Conversation
4b11240
to
b52dac9
Compare
b52dac9
to
38b82e0
Compare
7aec204
to
adf799f
Compare
adf799f
to
9e3f147
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, although i don't feel like i have enough experience to truly help
@Joerger Can you please take a look when you get a chance? |
85a34ef
to
f6d99ce
Compare
8eaa158
to
4805441
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's almost there but I left a few more comments for you to consider.
lib/services/watcher.go
Outdated
// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes | ||
// that already exist in the cluster. | ||
// It uses the AWS AccountID/InstanceID labels to detect nodes already present. | ||
func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments here:
- I'm not sure the method that deals with EC2-specific and discovery-specific logic belongs on the generic node watcher. Can this be a helper function where it's used?
- Does this method need to have a side-effect of modifying the passed-in map? Can it return the result instead and have the caller process it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes
to perform the iteration over all the currently cached nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed 👍
lib/services/watcher.go
Outdated
accountID, accountOK := labels[types.AWSAccountIDLabel] | ||
instanceID, instanceOK := labels[types.AWSInstanceIDLabel] | ||
// Checking only for the subkind is not enough because users can manually remove labels from agentless nodes. | ||
// Account/Instance IDs are required for comparing against new nodes, nodes without those labels are discarded. | ||
if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode { | ||
nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I think can be tucked inside Server resource as a couple helper methods which would make this logic clearer. For example, IsEIC()
that returns true if the server is EIC:
func (s *Server) IsEIC() bool {
labels := s.GetAllLabels()
_, hasAccountIDLabel := labels[types.AWSAccountIDLabel]
_, hasInstanceIDLabel := labels[types.AWSInstanceIDLabel]
return s.GetSubKind() == types.SubKindOpenSSHEICENode && hasAccountIDLabel && hasInstanceIDLabel
}
And then update GetName()
method to return appropriate name constructed by "ServerInfoNameFromAWS" for EIC servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsEICE
method created 👍
As for the GetName
, I would rather create a ServerInfoName
which returns the expected ServerInfo which might have the aws-<account-id>-<instance-id>
format if that's an AWS Node or the generic si-<node-name>
lib/services/watcher.go
Outdated
if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode { | ||
nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID) | ||
if _, found := fetchedEC2Instances[nodeEC2Key]; found { | ||
fetchedEC2Instances[nodeEC2Key] = server.GetName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're building a map of the constructed EC2 name to the server name. Why is it needed? Would we have to do this if we updated GetName() method to return correct name like in my proposal above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetName()
must return the Node's spec.name
and it's currently an uuid
Creating the ServerInfoName
simplified things a little bit.
Let me know what you think.
lib/srv/discovery/discovery.go
Outdated
fetchedEC2Instances := make(map[string]string, len(instances.Instances)) | ||
for _, ec2Instance := range instances.Instances { | ||
fetchedEC2Instances[types.ServerInfoNameFromAWS(instances.AccountID, ec2Instance.InstanceID)] = "" | ||
} | ||
s.nodeWatcher.FillNamesFromEC2Instances(s.ctx, fetchedEC2Instances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit strange. We're filling up a map with empty values, and then pass it to the function that fills it with actual values. Can we just make a single helper method that returns appropriate map, without side-effects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check the new version?
It mutates the map but it never leaves the method now.
lib/utils/spreadwork/spreadwork.go
Outdated
// If ctx is Done, return without processing the remaining chunks. | ||
case <-ctx.Done(): | ||
return nil | ||
case <-conf.clock.After(conf.BatchInterval): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create a ticker and "defer ticker.Stop()" it explicitly. I think Go has done some work recently to automatically garbage collect tickers but just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a ticker.
If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.
We define a Max Duration so that we don't exceed a given amount of time.
Having to wait exactly BatchInterval duration means we would most likely exceed that Max Duration value.
We can exceed it anyway, because we are never guaranteed on how long the applyFn
is going to take.
But this way, the exceeding time should be much less.
If applyFn
takes close to 0s, then it shouldn't matter whether we use a ticker or a timer.
This spreadwork
can evolve to support other scenarios:
- start a goroutine that will process a given chunk at periodic intervals (one that wouldn't wait for the current chunk to be processed)
- always wait X amount of time between chunks (what you proposed)
VS what we currently have which does sequential processing and never starts the next chunk before time.Now()+BatchInterval
(whether the BatchInterval is just a sleep or actual processing/applyFn
time)
lib/services/watcher.go
Outdated
// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes | ||
// that already exist in the cluster. | ||
// It uses the AWS AccountID/InstanceID labels to detect nodes already present. | ||
func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes
to perform the iteration over all the currently cached nodes.
lib/utils/spreadwork/spreadwork.go
Outdated
// If ctx is Done, return without processing the remaining chunks. | ||
case <-ctx.Done(): | ||
return nil | ||
case <-conf.clock.After(conf.BatchInterval): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.
4805441
to
44ac2c3
Compare
44ac2c3
to
f0f3bf2
Compare
lib/srv/discovery/discovery.go
Outdated
existingEICENodes := make(map[string]existingNodeInfo) | ||
|
||
s.nodeWatcher.GetNodes(s.ctx, func(n services.Node) bool { | ||
if !n.IsEICE() { | ||
return false | ||
} | ||
existingEICENodes[n.ServerInfoName()] = existingNodeInfo{ | ||
name: n.GetName(), | ||
expiration: n.GetMetadata().Expires, | ||
} | ||
return false | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need this extra existingEICENodes
variable. The logic sounds like should be pretty simple:
- Make
Server
object from the AWS instance we fetched from cloud. You already do this below. - Get respective
Server
resource from the cache. This resource has expiration info in it. - Compare both
Server
resources (we have comparison functions already):- If they're the same and Server's expiration is far in future, no-op.
- If they're different, add to "nodesToUpsert".
Should not require any extra supporting variables, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get respective Server resource from the cache. This resource has expiration info in it.
NodeWatcher only provides iteration methods, there's no method to retrieve a single Server.
Even if it existed, we would use the Server's Name because they get stored in a map whose key is the Server's Name.
The problem is that Server's Name is not deterministic - it is an UUID.
So, when we create a Server object from AWS instance, it will have a random UUID.
To understand if that Server already exists, we have to iterate over all Teleport Servers and see if there's one EICE Node that has the same AccountID/InstanceID (we don't use the Name).
If it exists, we replace the converted Server's name with the found one.
Possible alternatives:
- store another map in NodeWatcher where we use the ServerInfo format as key
- use the ServerInfo format as Node's name
Instead of ServerInfo format it could be any other format that would uniquely identify the node using the AWS AccountID and InstanceID.
I prefer the current version as opposed to the alternatives above.
But please let me know what you think of them and if you have another alternative.
Thank you
f0f3bf2
to
93a621a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
lib/srv/discovery/discovery.go
Outdated
for _, eiceServer := range existingEICEServersList { | ||
si, err := types.ServerInfoForServer(eiceServer) | ||
if err != nil { | ||
s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err) | |
s.Log.Debugf("Failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err) |
93a621a
to
eebdaa2
Compare
@marcoandredinis See the table below for backport results.
|
Add support for EICE mode in EC2 Auto-Discovery
If an Integration is being used, then default to EICE mode.
Every time the Discovery finds EC2 instances, it will upsert them with an expiration of now+90m.
This ends up being something similar to a heartbeat that exists for Teleport SSH Nodes.
Labels are always in sync (it can lag as much time as the Discovery Service Poll Interval), and when an EC2 instance is removed or the Matcher no longer matches the instance, the Node is removed (it can lag up to 90 minutes, as per the expiration).
Pre-req:
Create an Integration and add your first EC2 instance (this ensures your permissions are correct) and the EC2 Instance Connect Endpoint is created for that VPC.
Now, start a DiscoveryService (no need if you are on cloud/dev) and then create the following DiscoveryConfig:
Ensure the
discovery_group
is the same that you have running on the DiscoveryService (for cloud that's going to becloud-discovery-group
).You should see the newly created Nodes (it can take up to five minutes).
Demo
Fixes #34291