Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Auto-Discovery: EICE support #35504

Merged
merged 7 commits into from
Mar 26, 2024
Merged

Conversation

marcoandredinis
Copy link
Contributor

@marcoandredinis marcoandredinis commented Dec 7, 2023

Add support for EICE mode in EC2 Auto-Discovery

If an Integration is being used, then default to EICE mode.

Every time the Discovery finds EC2 instances, it will upsert them with an expiration of now+90m.
This ends up being something similar to a heartbeat that exists for Teleport SSH Nodes.
Labels are always in sync (it can lag as much time as the Discovery Service Poll Interval), and when an EC2 instance is removed or the Matcher no longer matches the instance, the Node is removed (it can lag up to 90 minutes, as per the expiration).

Pre-req:
Create an Integration and add your first EC2 instance (this ensures your permissions are correct) and the EC2 Instance Connect Endpoint is created for that VPC.
Now, start a DiscoveryService (no need if you are on cloud/dev) and then create the following DiscoveryConfig:

kind: discovery_config
version: v1
metadata:
  name: dc001
spec:
  discovery_group: "prod-resources"
  aws:
    - types: ["ec2"]
      regions: ["eu-west-2"]
      tags:
        "*": "*"
      integration: teleportdev

Ensure the discovery_group is the same that you have running on the DiscoveryService (for cloud that's going to be cloud-discovery-group).

You should see the newly created Nodes (it can take up to five minutes).

Demo

image

2024-01-10T17:32:04Z DEBU [DISCOVERY] EC2 instances discovered (AccountID: 278576220453, Instances: [i-05723b41571a653e9]), starting installation pid:1137334.1 discovery/discovery.go:950

Fixes #34291

@marcoandredinis marcoandredinis added aws Used for AWS Related Issues. discover Issues related to Teleport Discover backport/branch/v14 labels Dec 7, 2023
@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch 4 times, most recently from 4b11240 to b52dac9 Compare December 12, 2023 18:41
@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch from b52dac9 to 38b82e0 Compare December 22, 2023 14:34
@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch 6 times, most recently from 7aec204 to adf799f Compare January 10, 2024 17:32
@marcoandredinis marcoandredinis added the no-changelog Indicates that a PR does not require a changelog entry label Jan 10, 2024
@marcoandredinis marcoandredinis marked this pull request as ready for review January 10, 2024 17:33
@github-actions github-actions bot requested review from Joerger and kimlisa January 10, 2024 17:33
@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch from adf799f to 9e3f147 Compare January 16, 2024 17:15
Copy link
Contributor

@kimlisa kimlisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, although i don't feel like i have enough experience to truly help

lib/cloud/aws/tags_helpers.go Show resolved Hide resolved
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
lib/srv/discovery/reconciler.go Show resolved Hide resolved
@marcoandredinis
Copy link
Contributor Author

@Joerger Can you please take a look when you get a chance?

@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch 3 times, most recently from 85a34ef to f6d99ce Compare February 5, 2024 18:18
@marcoandredinis marcoandredinis requested review from r0mant and removed request for Joerger February 8, 2024 16:57
@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch 4 times, most recently from 8eaa158 to 4805441 Compare March 18, 2024 11:40
Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's almost there but I left a few more comments for you to consider.

// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes
// that already exist in the cluster.
// It uses the AWS AccountID/InstanceID labels to detect nodes already present.
func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments here:

  • I'm not sure the method that deals with EC2-specific and discovery-specific logic belongs on the generic node watcher. Can this be a helper function where it's used?
  • Does this method need to have a side-effect of modifying the passed-in map? Can it return the result instead and have the caller process it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes to perform the iteration over all the currently cached nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed 👍

Comment on lines 1787 to 1792
accountID, accountOK := labels[types.AWSAccountIDLabel]
instanceID, instanceOK := labels[types.AWSInstanceIDLabel]
// Checking only for the subkind is not enough because users can manually remove labels from agentless nodes.
// Account/Instance IDs are required for comparing against new nodes, nodes without those labels are discarded.
if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode {
nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think can be tucked inside Server resource as a couple helper methods which would make this logic clearer. For example, IsEIC() that returns true if the server is EIC:

func (s *Server) IsEIC() bool {
  labels := s.GetAllLabels()
  _, hasAccountIDLabel := labels[types.AWSAccountIDLabel]
  _, hasInstanceIDLabel := labels[types.AWSInstanceIDLabel]
  return s.GetSubKind() == types.SubKindOpenSSHEICENode && hasAccountIDLabel && hasInstanceIDLabel
}

And then update GetName() method to return appropriate name constructed by "ServerInfoNameFromAWS" for EIC servers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsEICE method created 👍

As for the GetName, I would rather create a ServerInfoName which returns the expected ServerInfo which might have the aws-<account-id>-<instance-id> format if that's an AWS Node or the generic si-<node-name>

if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode {
nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID)
if _, found := fetchedEC2Instances[nodeEC2Key]; found {
fetchedEC2Instances[nodeEC2Key] = server.GetName()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're building a map of the constructed EC2 name to the server name. Why is it needed? Would we have to do this if we updated GetName() method to return correct name like in my proposal above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetName() must return the Node's spec.name and it's currently an uuid
Creating the ServerInfoName simplified things a little bit.
Let me know what you think.

lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
Comment on lines 835 to 839
fetchedEC2Instances := make(map[string]string, len(instances.Instances))
for _, ec2Instance := range instances.Instances {
fetchedEC2Instances[types.ServerInfoNameFromAWS(instances.AccountID, ec2Instance.InstanceID)] = ""
}
s.nodeWatcher.FillNamesFromEC2Instances(s.ctx, fetchedEC2Instances)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit strange. We're filling up a map with empty values, and then pass it to the function that fills it with actual values. Can we just make a single helper method that returns appropriate map, without side-effects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the new version?
It mutates the map but it never leaves the method now.

// If ctx is Done, return without processing the remaining chunks.
case <-ctx.Done():
return nil
case <-conf.clock.After(conf.BatchInterval):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a ticker and "defer ticker.Stop()" it explicitly. I think Go has done some work recently to automatically garbage collect tickers but just in case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a ticker.

If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

We define a Max Duration so that we don't exceed a given amount of time.
Having to wait exactly BatchInterval duration means we would most likely exceed that Max Duration value.

We can exceed it anyway, because we are never guaranteed on how long the applyFn is going to take.
But this way, the exceeding time should be much less.
If applyFn takes close to 0s, then it shouldn't matter whether we use a ticker or a timer.

This spreadwork can evolve to support other scenarios:

  • start a goroutine that will process a given chunk at periodic intervals (one that wouldn't wait for the current chunk to be processed)
  • always wait X amount of time between chunks (what you proposed)

VS what we currently have which does sequential processing and never starts the next chunk before time.Now()+BatchInterval (whether the BatchInterval is just a sleep or actual processing/applyFn time)

lib/utils/spreadwork/spreadwork.go Show resolved Hide resolved
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
lib/utils/spreadwork/spreadwork.go Outdated Show resolved Hide resolved
// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes
// that already exist in the cluster.
// It uses the AWS AccountID/InstanceID labels to detect nodes already present.
func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes to perform the iteration over all the currently cached nodes.

lib/services/watcher.go Outdated Show resolved Hide resolved
// If ctx is Done, return without processing the remaining chunks.
case <-ctx.Done():
return nil
case <-conf.clock.After(conf.BatchInterval):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

lib/utils/spreadwork/spreadwork_test.go Outdated Show resolved Hide resolved
api/types/server.go Outdated Show resolved Hide resolved
api/types/server.go Outdated Show resolved Hide resolved
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
Comment on lines 840 to 851
existingEICENodes := make(map[string]existingNodeInfo)

s.nodeWatcher.GetNodes(s.ctx, func(n services.Node) bool {
if !n.IsEICE() {
return false
}
existingEICENodes[n.ServerInfoName()] = existingNodeInfo{
name: n.GetName(),
expiration: n.GetMetadata().Expires,
}
return false
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need this extra existingEICENodes variable. The logic sounds like should be pretty simple:

  • Make Server object from the AWS instance we fetched from cloud. You already do this below.
  • Get respective Server resource from the cache. This resource has expiration info in it.
  • Compare both Server resources (we have comparison functions already):
    • If they're the same and Server's expiration is far in future, no-op.
    • If they're different, add to "nodesToUpsert".

Should not require any extra supporting variables, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get respective Server resource from the cache. This resource has expiration info in it.

NodeWatcher only provides iteration methods, there's no method to retrieve a single Server.
Even if it existed, we would use the Server's Name because they get stored in a map whose key is the Server's Name.

The problem is that Server's Name is not deterministic - it is an UUID.
So, when we create a Server object from AWS instance, it will have a random UUID.
To understand if that Server already exists, we have to iterate over all Teleport Servers and see if there's one EICE Node that has the same AccountID/InstanceID (we don't use the Name).
If it exists, we replace the converted Server's name with the found one.

Possible alternatives:

  • store another map in NodeWatcher where we use the ServerInfo format as key
  • use the ServerInfo format as Node's name

Instead of ServerInfo format it could be any other format that would uniquely identify the node using the AWS AccountID and InstanceID.

I prefer the current version as opposed to the alternatives above.
But please let me know what you think of them and if you have another alternative.

Thank you

@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch from f0f3bf2 to 93a621a Compare March 26, 2024 11:54
Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

for _, eiceServer := range existingEICEServersList {
si, err := types.ServerInfoForServer(eiceServer)
if err != nil {
s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)
s.Log.Debugf("Failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)

@marcoandredinis marcoandredinis force-pushed the marco/autodiscovery_eice branch from 93a621a to eebdaa2 Compare March 26, 2024 15:48
@marcoandredinis marcoandredinis added this pull request to the merge queue Mar 26, 2024
Merged via the queue into master with commit 455632e Mar 26, 2024
36 checks passed
@marcoandredinis marcoandredinis deleted the marco/autodiscovery_eice branch March 26, 2024 16:34
@public-teleport-github-review-bot

@marcoandredinis See the table below for backport results.

Branch Result
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Used for AWS Related Issues. backport/branch/v15 discover Issues related to Teleport Discover discovery no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EC2 Instance Connect Endpoint (ICE) with EC2 Instance Discovery
7 participants