EC2 Auto-Discovery: EICE support #35504

marcoandredinis · 2023-12-07T18:02:32Z

Add support for EICE mode in EC2 Auto-Discovery

If an Integration is being used, then default to EICE mode.

Every time the Discovery finds EC2 instances, it will upsert them with an expiration of now+90m.
This ends up being something similar to a heartbeat that exists for Teleport SSH Nodes.
Labels are always in sync (it can lag as much time as the Discovery Service Poll Interval), and when an EC2 instance is removed or the Matcher no longer matches the instance, the Node is removed (it can lag up to 90 minutes, as per the expiration).

Pre-req:
Create an Integration and add your first EC2 instance (this ensures your permissions are correct) and the EC2 Instance Connect Endpoint is created for that VPC.
Now, start a DiscoveryService (no need if you are on cloud/dev) and then create the following DiscoveryConfig:

kind: discovery_config
version: v1
metadata:
  name: dc001
spec:
  discovery_group: "prod-resources"
  aws:
    - types: ["ec2"]
      regions: ["eu-west-2"]
      tags:
        "*": "*"
      integration: teleportdev

Ensure the discovery_group is the same that you have running on the DiscoveryService (for cloud that's going to be cloud-discovery-group).

You should see the newly created Nodes (it can take up to five minutes).

Demo

2024-01-10T17:32:04Z DEBU [DISCOVERY] EC2 instances discovered (AccountID: 278576220453, Instances: [i-05723b41571a653e9]), starting installation pid:1137334.1 discovery/discovery.go:950

Fixes #34291

kimlisa

lgtm, although i don't feel like i have enough experience to truly help

lib/cloud/aws/tags_helpers.go

lib/srv/discovery/discovery.go

lib/srv/discovery/reconciler.go

marcoandredinis · 2024-01-23T18:16:45Z

@Joerger Can you please take a look when you get a chance?

r0mant

I think it's almost there but I left a few more comments for you to consider.

r0mant · 2024-03-21T20:12:57Z

lib/services/watcher.go

+// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes
+// that already exist in the cluster.
+// It uses the AWS AccountID/InstanceID labels to detect nodes already present.
+func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) {


Two comments here:

I'm not sure the method that deals with EC2-specific and discovery-specific logic belongs on the generic node watcher. Can this be a helper function where it's used?

Does this method need to have a side-effect of modifying the passed-in map? Can it return the result instead and have the caller process it?

+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes to perform the iteration over all the currently cached nodes.

Changed 👍

r0mant · 2024-03-21T20:24:26Z

lib/services/watcher.go

+		accountID, accountOK := labels[types.AWSAccountIDLabel]
+		instanceID, instanceOK := labels[types.AWSInstanceIDLabel]
+		// Checking only for the subkind is not enough because users can manually remove labels from agentless nodes.
+		// Account/Instance IDs are required for comparing against new nodes, nodes without those labels are discarded.
+		if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode {
+			nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID)


This I think can be tucked inside Server resource as a couple helper methods which would make this logic clearer. For example, IsEIC() that returns true if the server is EIC:

func (s *Server) IsEIC() bool { labels := s.GetAllLabels() _, hasAccountIDLabel := labels[types.AWSAccountIDLabel] _, hasInstanceIDLabel := labels[types.AWSInstanceIDLabel] return s.GetSubKind() == types.SubKindOpenSSHEICENode && hasAccountIDLabel && hasInstanceIDLabel }

And then update GetName() method to return appropriate name constructed by "ServerInfoNameFromAWS" for EIC servers.

IsEICE method created 👍

As for the GetName, I would rather create a ServerInfoName which returns the expected ServerInfo which might have the aws-<account-id>-<instance-id> format if that's an AWS Node or the generic si-<node-name>

r0mant · 2024-03-22T00:31:09Z

lib/services/watcher.go

+		if accountOK && instanceOK && server.GetSubKind() == types.SubKindOpenSSHEICENode {
+			nodeEC2Key := types.ServerInfoNameFromAWS(accountID, instanceID)
+			if _, found := fetchedEC2Instances[nodeEC2Key]; found {
+				fetchedEC2Instances[nodeEC2Key] = server.GetName()


Looks like we're building a map of the constructed EC2 name to the server name. Why is it needed? Would we have to do this if we updated GetName() method to return correct name like in my proposal above?

GetName() must return the Node's spec.name and it's currently an uuid
Creating the ServerInfoName simplified things a little bit.
Let me know what you think.

lib/srv/discovery/discovery.go

r0mant · 2024-03-22T00:34:00Z

lib/srv/discovery/discovery.go

+	fetchedEC2Instances := make(map[string]string, len(instances.Instances))
+	for _, ec2Instance := range instances.Instances {
+		fetchedEC2Instances[types.ServerInfoNameFromAWS(instances.AccountID, ec2Instance.InstanceID)] = ""
+	}
+	s.nodeWatcher.FillNamesFromEC2Instances(s.ctx, fetchedEC2Instances)


This is a bit strange. We're filling up a map with empty values, and then pass it to the function that fills it with actual values. Can we just make a single helper method that returns appropriate map, without side-effects?

Can you please check the new version?
It mutates the map but it never leaves the method now.

r0mant · 2024-03-22T01:01:38Z

lib/utils/spreadwork/spreadwork.go

+		// If ctx is Done, return without processing the remaining chunks.
+		case <-ctx.Done():
+			return nil
+		case <-conf.clock.After(conf.BatchInterval):


I would create a ticker and "defer ticker.Stop()" it explicitly. I think Go has done some work recently to automatically garbage collect tickers but just in case.

I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

Changed to a ticker.

If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

We define a Max Duration so that we don't exceed a given amount of time.
Having to wait exactly BatchInterval duration means we would most likely exceed that Max Duration value.

We can exceed it anyway, because we are never guaranteed on how long the applyFn is going to take.
But this way, the exceeding time should be much less.
If applyFn takes close to 0s, then it shouldn't matter whether we use a ticker or a timer.

This spreadwork can evolve to support other scenarios:

start a goroutine that will process a given chunk at periodic intervals (one that wouldn't wait for the current chunk to be processed)

always wait X amount of time between chunks (what you proposed)

VS what we currently have which does sequential processing and never starts the next chunk before time.Now()+BatchInterval (whether the BatchInterval is just a sleep or actual processing/applyFn time)

lib/utils/spreadwork/spreadwork.go

lib/srv/discovery/discovery.go

lib/utils/spreadwork/spreadwork.go

rosstimothy · 2024-03-22T13:10:39Z

lib/services/watcher.go

+// FillNamesFromEC2Instances iterates over all nodes (cached) and fills in the fetchedEC2Instances value for nodes
+// that already exist in the cluster.
+// It uses the AWS AccountID/InstanceID labels to detect nodes already present.
+func (n *nodeCollector) FillNamesFromEC2Instances(ctx context.Context, fetchedEC2Instances map[string]string) {


+1. This logic should be moved to whatever needs to know ec2 specific details and call GetNodes to perform the iteration over all the currently cached nodes.

lib/services/watcher.go

rosstimothy · 2024-03-22T13:14:49Z

lib/utils/spreadwork/spreadwork.go

+		// If ctx is Done, return without processing the remaining chunks.
+		case <-ctx.Done():
+			return nil
+		case <-conf.clock.After(conf.BatchInterval):


I would maybe use a timer instead of a ticker. If the apply function above takes longer than one tick it would result in immediately unblocking this instead of actually spreading out the work at a constant interval.

lib/utils/spreadwork/spreadwork_test.go

api/types/server.go

lib/srv/discovery/discovery.go

r0mant · 2024-03-25T18:35:03Z

lib/srv/discovery/discovery.go

+	existingEICENodes := make(map[string]existingNodeInfo)
+
+	s.nodeWatcher.GetNodes(s.ctx, func(n services.Node) bool {
+		if !n.IsEICE() {
+			return false
+		}
+		existingEICENodes[n.ServerInfoName()] = existingNodeInfo{
+			name:       n.GetName(),
+			expiration: n.GetMetadata().Expires,
+		}
+		return false
+	})


I'm not sure why we need this extra existingEICENodes variable. The logic sounds like should be pretty simple:

Make Server object from the AWS instance we fetched from cloud. You already do this below.

Get respective Server resource from the cache. This resource has expiration info in it.

Compare both Server resources (we have comparison functions already):

If they're the same and Server's expiration is far in future, no-op.

If they're different, add to "nodesToUpsert".

Should not require any extra supporting variables, no?

Get respective Server resource from the cache. This resource has expiration info in it.

NodeWatcher only provides iteration methods, there's no method to retrieve a single Server.
Even if it existed, we would use the Server's Name because they get stored in a map whose key is the Server's Name.

The problem is that Server's Name is not deterministic - it is an UUID.
So, when we create a Server object from AWS instance, it will have a random UUID.
To understand if that Server already exists, we have to iterate over all Teleport Servers and see if there's one EICE Node that has the same AccountID/InstanceID (we don't use the Name).
If it exists, we replace the converted Server's name with the found one.

Possible alternatives:

store another map in NodeWatcher where we use the ServerInfo format as key

use the ServerInfo format as Node's name

Instead of ServerInfo format it could be any other format that would uniquely identify the node using the AWS AccountID and InstanceID.

I prefer the current version as opposed to the alternatives above.
But please let me know what you think of them and if you have another alternative.

Thank you

r0mant

lgtm

r0mant · 2024-03-26T15:38:27Z

lib/srv/discovery/discovery.go

+	for _, eiceServer := range existingEICEServersList {
+		si, err := types.ServerInfoForServer(eiceServer)
+		if err != nil {
+			s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)


Suggested change

s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)

s.Log.Debugf("Failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)

public-teleport-github-review-bot · 2024-03-26T16:35:15Z

@marcoandredinis See the table below for backport results.

Branch	Result
branch/v15	Create PR

marcoandredinis added aws Used for AWS Related Issues. discover Issues related to Teleport Discover backport/branch/v14 labels Dec 7, 2023

marcoandredinis mentioned this pull request Dec 7, 2023

Add ability to manage auto-discovery settings dynamically #25494

Closed

marcoandredinis force-pushed the marco/autodiscovery_eice branch 4 times, most recently from 4b11240 to b52dac9 Compare December 12, 2023 18:41

marcoandredinis force-pushed the marco/autodiscovery_eice branch from b52dac9 to 38b82e0 Compare December 22, 2023 14:34

marcoandredinis force-pushed the marco/autodiscovery_eice branch 6 times, most recently from 7aec204 to adf799f Compare January 10, 2024 17:32

marcoandredinis added the no-changelog Indicates that a PR does not require a changelog entry label Jan 10, 2024

marcoandredinis marked this pull request as ready for review January 10, 2024 17:33

github-actions bot requested review from Joerger and kimlisa January 10, 2024 17:33

github-actions bot added discovery size/md labels Jan 10, 2024

marcoandredinis added the backport/branch/v15 label Jan 15, 2024

marcoandredinis force-pushed the marco/autodiscovery_eice branch from adf799f to 9e3f147 Compare January 16, 2024 17:15

kimlisa approved these changes Jan 16, 2024

View reviewed changes

lib/cloud/aws/tags_helpers.go Show resolved Hide resolved

lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved

lib/srv/discovery/reconciler.go Show resolved Hide resolved

marcoandredinis force-pushed the marco/autodiscovery_eice branch 3 times, most recently from 85a34ef to f6d99ce Compare February 5, 2024 18:18

marcoandredinis requested review from r0mant and removed request for Joerger February 8, 2024 16:57

marcoandredinis force-pushed the marco/autodiscovery_eice branch 4 times, most recently from 8eaa158 to 4805441 Compare March 18, 2024 11:40

r0mant reviewed Mar 22, 2024

View reviewed changes

rosstimothy reviewed Mar 22, 2024

View reviewed changes

marcoandredinis force-pushed the marco/autodiscovery_eice branch from 4805441 to 44ac2c3 Compare March 25, 2024 16:01

marcoandredinis requested review from r0mant and rosstimothy March 25, 2024 16:01

marcoandredinis force-pushed the marco/autodiscovery_eice branch from 44ac2c3 to f0f3bf2 Compare March 25, 2024 17:15

r0mant reviewed Mar 25, 2024

View reviewed changes

marcoandredinis force-pushed the marco/autodiscovery_eice branch from f0f3bf2 to 93a621a Compare March 26, 2024 11:54

r0mant approved these changes Mar 26, 2024

View reviewed changes

marcoandredinis added 7 commits March 26, 2024 15:47

AutoDiscover: add support for EICE nodes

720a886

ignore openssh nodes when updating labels from serverinfo

d5256e2

downcase imports

33c04c8

improve comment on why we need account/instance and subkind

9af0fba

add applyovertime method

69a8f78

skip nodes that wont expire in the near future

8abae74

create helper functions

eebdaa2

marcoandredinis force-pushed the marco/autodiscovery_eice branch from 93a621a to eebdaa2 Compare March 26, 2024 15:48

marcoandredinis removed the backport/branch/v14 label Mar 26, 2024

marcoandredinis enabled auto-merge March 26, 2024 16:15

marcoandredinis added this pull request to the merge queue Mar 26, 2024

Merged via the queue into master with commit 455632e Mar 26, 2024
36 checks passed

marcoandredinis deleted the marco/autodiscovery_eice branch March 26, 2024 16:34

This was referenced Mar 26, 2024

[v15] EC2 Auto-Discovery: EICE support #39843

Merged

Document SSH access using EC2 Instance Connect Endpoint #40020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 Auto-Discovery: EICE support #35504

EC2 Auto-Discovery: EICE support #35504

marcoandredinis commented Dec 7, 2023 •

edited

Loading

kimlisa left a comment

marcoandredinis commented Jan 23, 2024

r0mant left a comment

r0mant Mar 21, 2024

rosstimothy Mar 22, 2024

marcoandredinis Mar 25, 2024

r0mant Mar 21, 2024

marcoandredinis Mar 25, 2024

r0mant Mar 22, 2024

marcoandredinis Mar 25, 2024

r0mant Mar 22, 2024

marcoandredinis Mar 25, 2024

r0mant Mar 22, 2024

rosstimothy Mar 22, 2024

marcoandredinis Mar 25, 2024

rosstimothy Mar 22, 2024

rosstimothy Mar 22, 2024

r0mant Mar 25, 2024

marcoandredinis Mar 26, 2024

r0mant left a comment

r0mant Mar 26, 2024

public-teleport-github-review-bot bot commented Mar 26, 2024

	s.Log.Debugf("failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)
	s.Log.Debugf("Failed to convert Server %q to ServerInfo: %v", eiceServer.GetName(), err)

EC2 Auto-Discovery: EICE support #35504

EC2 Auto-Discovery: EICE support #35504

Conversation

marcoandredinis commented Dec 7, 2023 • edited Loading

kimlisa left a comment

Choose a reason for hiding this comment

marcoandredinis commented Jan 23, 2024

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

public-teleport-github-review-bot bot commented Mar 26, 2024

marcoandredinis commented Dec 7, 2023 •

edited

Loading