Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.0 - Issue with EC2 Instance Metadata running inside Container #23110

Closed
kylegoch opened this issue Feb 10, 2022 · 36 comments · Fixed by #23191
Closed

4.0 - Issue with EC2 Instance Metadata running inside Container #23110

kylegoch opened this issue Feb 10, 2022 · 36 comments · Fixed by #23191
Assignees
Labels
authentication Pertains to authentication; to the provider itself of otherwise. provider Pertains to the provider itself, rather than any interaction with AWS. regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. upstream Addresses functionality related to the cloud provider.
Milestone

Comments

@kylegoch
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

v1.1.2

Affected Resource(s)

The provider itself

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

provider "aws" {
  region = "us-east-2"

  assume_role {
    role_arn = "<redacted>"
  }
}

Debug Output

Panic Output

Expected Behavior

Terraform should plan and run using the EC2 metadata.

Actual Behavior

| Error: error configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.
│ 
│ Please see https://registry.terraform.io/providers/hashicorp/aws
│ for more information about providing credentials.
│ 
│ Error: no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded
│ 
│ 
│   with provider["registry.terraform.io/hashicorp/aws"],
│   on configuration.tf line 11, in provider "aws":
│   11: provider "aws" {

Steps to Reproduce

  1. terraform plan

Important Factoids

Today when switching to v4.0, we discovered we could no longer run Terraform on EC2 instances that use the AWS Instance Metadata service. Running v4.0 locally works fine. But running the same terraform on an EC2 instance (such as for CICD) results in the error shown above.

Rolling back to 3.74.1 fixes the issue and all works as planned.

The instances in question are running both v1 and v2 of the Instance Metadata service.

@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Feb 10, 2022
@ewbankkit ewbankkit added the provider Pertains to the provider itself, rather than any interaction with AWS. label Feb 10, 2022
@YakDriver YakDriver added upstream Addresses functionality related to the cloud provider. regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. and removed needs-triage Waiting for first response or review from a maintainer. labels Feb 10, 2022
@ntman4real
Copy link

@YakDriver so what is the fix? Whole lot of things are breaking at the moment. Should have pinned provider ver. but didn't.

FYI same issue as OP, works locally but when using assumed role via CICD EC2 runners, the issue exists.

@gdavison
Copy link
Contributor

Thanks for reporting this, @kylegoch. Could you attach the debug log please? You can enable debug logging by setting the environment variable TF_LOG=DEBUG. For more information on logging, see https://www.terraform.io/internals/debugging

@gdavison gdavison self-assigned this Feb 10, 2022
@ntman4real
Copy link

here is my debug log @gdavison

debug.txt

@opalmer
Copy link

opalmer commented Feb 11, 2022

I was working on producing a debug log too and ran across something interesting. It does work in my case but specifically outside of the docker container running the terraform job in our case. The container is running on a host where the instance metadata service, just like @kylegoch's, is exposed and the container has access to it via a bridge network. Here's an example of how that container is spun up:

sudo docker run --name test-container --network=bridge -ti alpine:latest /bin/sh

I wrote a small test program that I think reproduces the behavior I'm seeing too. On the host it runs fine but inside the container it produces:


Please see https://registry.terraform.io/providers/hashicorp/aws
for more information about providing credentials.

Error: no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded


goroutine 1 [running]:
main.main()
	/Users/opalmer/go/src/github.com/terraform-providers/terraform-provider-aws/main.go:56 +0x595

Output from docker info and the test program itself is attached. Host network wise we're not doing anything special with iptables that could be causing this behavior. I'll continue digging on my side to see if there's anything else I can track down that might be helpful.

@opalmer
Copy link

opalmer commented Feb 11, 2022

... and of course it would be helpful if I actually included that program I mentioned haha.

debug.tar.gz

@breser
Copy link

breser commented Feb 11, 2022

This is happening because v4.0.0 is using IMDSv2. IMDSv2 requires a PUT to retrieve the token. There is a setting that limits the number of hops that the response to that PUT will go before being dropped by the network, httpPutResponseHopLimit, the default for this setting is 1.

This means if you are more than one network hop away from the the IMDS then you will get the errors. The most common reason for this is that you're running in a docker container.

The solution is to increase the hop count to at least 2. Running the following command will fix this:
aws ec2 modify-instance-metadata-options --instance-id "$INSTANCE_ID" --http-put-response-hop-limit 2 --http-endpoint enabled

This change is very vaguely mentioned in the release notes for v4.0.0:

provider: Updates AWS authentication to use AWS SDK for Go v2 https://aws.github.io/aws-sdk-go-v2/docs/ (#20587)

This doesn't seem to impact the AWS CLI when running in a docker container with the hop count set to 1 on a host that allows v1 and v2. I suspect because it either falls back to v1 or tries v1 first and never fails to get the response from the PUT. But that's probably not the terraform provider's issue but really the AWS Go SDK.

This is likely to impact a lot of people because many build systems use docker containers these days. I'd strongly recommend working to get the Go SDK to do something intelligent here.

@gdavison
Copy link
Contributor

The Provider is using the AWS SDK for Go v2 for authentication. According to AWS documentation,

The AWS SDKs use IMDSv2 calls by default. If the IMDSv2 call receives no response, the SDK retries the call and, if still unsuccessful, uses IMDSv1. This can result in a delay. In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2.

The AWS SDK for Go v1 also tries IMDSv2 first, so it's not clear why it worked with earlier versions of the provider and fails with v4.0.

We can update our documentation and try to return a more helpful message.

@opalmer
Copy link

opalmer commented Feb 11, 2022

@breser good call on that second hop, forgot about that with IMDS!

@gdavison, I'm going to echo what @breser suggests and work on figuring out how to make this fall back correctly. The test code I attached reproduced the issue but the following code also works out of the box inside a docker container:

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/aws/aws-sdk-go-v2/config"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
	defer cancel()

	cfg, err := config.LoadDefaultConfig(ctx)
	if err != nil {
		panic(err)
	}

	creds, err := cfg.Credentials.Retrieve(ctx)
	if err != nil {
		panic(err)
	}

	fmt.Println(creds)
}

This is using the same version of the AWS Go SDK v2 that the terraform provider is using. In fact, the above is what I dropped directly into main.go after cloning down this project and checking out the v4.0.0 tag.

If the default configuration from the AWS SDK works around this issue then I believe the provider should be as well. I suspect it's specifically something to do with how the AWS config that's being generated in github.com/hashicorp/aws-sdk-go-base. In addition, another reason to fix this is if you do something weird with your network that number of hops could change easily breaking terraform requiring another instance level modification. If someone has specifically disabled the fallback on their host that's one thing but if it's enabled and available to terraform then the provider should take advantage of that after trying the better (IMDSv2) one.

@Grummfy
Copy link

Grummfy commented Feb 11, 2022

Tips for whoever are block by this, rollback to v3, see #20433 and use

terraform {
  required_providers {
    aws = {
      version = "~> 3.0"
    }
  }
}

@FernandoMiguel
Copy link
Contributor

Anyone using ASG you need

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template#metadata-options
if you are using the community ASG module, https://github.com/terraform-aws-modules/terraform-aws-autoscaling#input_metadata_options

@rexsuecia
Copy link

rexsuecia commented Feb 11, 2022

I ran into the same, my conclusion became that with provider 4.0.0 it does no longer respect the credentials set in the environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).

I run a number of project on GitLab CI/CD and have used the same approach for years, but today it broke. I have the secrets stored on group account in environment variables.

When it does not find the credentials in the environment it looks for "profiles" in ~/.aws/credentials etc. When that fails it tries the meta data service (which I do not have since I do not run on EC2 (I think GitLab runs on GCP)) and hence fails miserably.

My workaround was to put the creds in a file like:
mkdir -p ~/.aws && echo -e "[profile-name]\naws_access_key_id = $AWS_ACCESS_KEY_ID\naws_secret_access_key = $AWS_SECRET_ACCESS_KEY\n" > ~/.aws/credentials

And finally carefully clean out these credentials rm -rf ~/.aws/credentials

Far from optimal but doing so it works, for me.

FWIW

@YakDriver
Copy link
Member

@rexsuecia Thank you for reporting this additional aspect. The provider should still respect the access and secret key env vars. Will you create a separate issue for that so we can track it?

@rexsuecia
Copy link

@YakDriver I wish I had the time to do that. But the issue submission process is so cumbersome I simply cannot do that in near time, I have 20+ projects that need patching to handle this (and the other interesting breaking changes in v4 ( ;-) ) so I am more than busy this weekend and I bet you guys will have released a fixed 4.0.1 before I even have created a reproducable gist.

@aissarmurad
Copy link

Tips for whoever are block by this, rollback to v3, see #20433 and use

terraform {
  required_providers {
    aws = {
      version = "~> 3.0"
    }
  }
}

The workaround in the most part of the time will be

terraform {
  required_providers {
     aws = {
       version = "~> 3"
     }
  }
}

@FernandoMiguel
Copy link
Contributor

Tips for whoever are block by this, rollback to v3, see #20433 and use

terraform {
  required_providers {
    aws = {
      version = "~> 3.0"
    }
  }
}

The workaround in the most part of the time will be

terraform {
  required_providers {
     aws = {
       version = "~> 3"
     }
  }
}

That will not work.
That means 3 or pretty much anything.
You need 3.0 so it does any releases within major 3.

@aissarmurad
Copy link

@FernandoMiguel according to the Terraform documentation

~>: Allows only the rightmost version component to increment. For example, to allow new patch releases within a specific minor release, use the full version number: ~> 1.0.4 will allow installation of 1.0.5 and 1.0.10 but not 1.1.0. This is usually called the pessimistic constraint operator.

Reference
https://www.terraform.io/language/expressions/version-constraints

@FernandoMiguel
Copy link
Contributor

@FernandoMiguel according to the Terraform documentation

~>: Allows only the rightmost version component to increment. For example, to allow new patch releases within a specific minor release, use the full version number: ~> 1.0.4 will allow installation of 1.0.5 and 1.0.10 but not 1.1.0. This is usually called the pessimistic constraint operator.

Reference

https://www.terraform.io/language/expressions/version-constraints

That's exactly what I said 😉
What do you think happens if you pin 3 only without a dot zero?

@gdavison
Copy link
Contributor

Thanks for your patience, everyone. We're investigating what has changed between v4.0 and previous versions that causes this to fail inside containers now.

If you have other authentication issues that are not related to using the EC2 Instance Metadata Service from inside a Container, please open a new issue so that they can be tracked separately.

@gdavison gdavison changed the title 4.0 - Issue with EC2 Instance Metadata 4.0 - Issue with EC2 Instance Metadata running inside Container Feb 11, 2022
@gdavison
Copy link
Contributor

@opalmer thanks for your investigation. When the instance is configured to use either IMDSv1 or IMDSv2, the sample code succeeds, but when IMDSv2 is required, the sample code fails with

panic: no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded

@opalmer and @kylegoch, can you paste the output of aws ec2 describe-instances --instance-ids <instance id> | jq '.Reservations[0].Instances[0].MetadataOptions'

@gdavison
Copy link
Contributor

I've just tried the provider v3 authentication flow in a container with both IMDSv1 and IMDSv2, which succeeds, and requiring IMDSv2, which fails.

@breser
Copy link

breser commented Feb 12, 2022

This was happening for me with machines that had IMDSv1 and IMDSv2 enabled (taken from a AWS Config snapshot that I pulled down trying to investigate this issue yesterday):
"metadataOptions": { "state": "applied", "httpTokens": "optional", "httpPutResponseHopLimit": 1, "httpEndpoint": "enabled" },

@gdavison
Copy link
Contributor

@breser was Terraform running in a container? Can you share the contents of your provider configuration block, please?

provider "aws" {
  ...
}

@breser
Copy link

breser commented Feb 12, 2022

Yes runnning in a container:

provider "aws" {
  region  = var.region
  assume_role {
    role_arn = "arn:aws:iam::${var.account_id}:role/RoleName"
  }
}

Using terraform 0.13.7 (yes I know it's old).

Provider versions from the init output:

Initializing provider plugins...
- Finding hashicorp/template versions matching ">= 2.1.2"...
- Finding hashicorp/aws versions matching ">= 2.55.0"...
- Installing hashicorp/template v2.2.0...
- Installed hashicorp/template v2.2.0 (signed by HashiCorp)
- Installing hashicorp/aws v4.0.0...
- Installed hashicorp/aws v4.0.0 (signed by HashiCorp)

@FernandoMiguel
Copy link
Contributor

Thanks. Totally agree this will impact a lot of people. Especially those who terminate build servers to save costs on the weekends like me. I will have to find a way to run the aws cli command every time the instance starts

Why?
It's a simple metadata option you path to the launch config of that vm.. It's not even cloud init change. So, super simple.

@FernandoMiguel
Copy link
Contributor

Thanks. Totally agree this will impact a lot of people. Especially those who terminate build servers to save costs on the weekends like me. I will have to find a way to run the aws cli command every time the instance starts

Why? It's a simple metadata option you path to the launch config of that vm.. It's not even cloud init change. So, super simple.

Yes, but I have multiple Service Catalog templates that spin up Runners for multiple projects. I will have to find a way to add that option to the template

Welcome to manage infra with code.
What would you do if you had to add extra EBS volume?

@dr-travis
Copy link

The following solution works for me.

Change the paths to aws config file and credential file from:

provider "aws" {
  region = "us-east-2"
  shared_config_files=["~/.aws/config"] # Or $HOME/.aws/config
  shared_credentials_files = ["~/.aws/credentials"] # Or $HOME/.aws/credentials
  profile = "default"
}

to

provider "aws" {
  region = "us-east-2"
  shared_config_files=["/Users/me/.aws/config"]
  shared_credentials_files = ["/Users/me/.aws/credentials"]
  profile = "default"
}

@willthames
Copy link

willthames commented Feb 14, 2022

Thanks. Totally agree this will impact a lot of people. Especially those who terminate build servers to save costs on the weekends like me. I will have to find a way to run the aws cli command every time the instance starts

Why? It's a simple metadata option you path to the launch config of that vm.. It's not even cloud init change. So, super simple.

Yes, but I have multiple Service Catalog templates that spin up Runners for multiple projects. I will have to find a way to add that option to the template

Welcome to manage infra with code. What would you do if you had to add extra EBS volume?

This reply seems unnecessarily dismissive.

In our case, our agents are managed by terraform cloud - we're paying hashicorp good money to avoid having to manage terraform workers - and we don't have the level of access to be able to configure metadata settings.

Edit: oops, the issue is occurring on agents running on our infrastructure, which I do have control of.

@FernandoMiguel
Copy link
Contributor

FernandoMiguel commented Feb 14, 2022 via email

@mccartney
Copy link

For people using the EC2 plugin in Jenkins and configure Jenkins as YAML code, this line (shown as the last one) helps:

          - description: "my worker"
            type: Z1d6xlarge
[...]
            associatePublicIp: true
            metadataHopsLimit: 2

@chris-peterson
Copy link
Contributor

surprised to see all the mentions of "fixing up existing instances" with various CLI incantations.

the bread and butter of terraform is immutable infrastructure.

IMO, the right sustainable fix is to modify metadata_options in your terraform source(s). this will vary slightly based on how you are creating instances, but the various mechanisms support similar functionality:

The field to pay special attention to is http_put_response_hop_limit which should be changed from its default (1) to 2 (for most cases)

In my case, we were using launch configurations, adding the following to the aws_launch_configuration that creates our infrastructure builders got things back to ✅

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "optional"
    http_put_response_hop_limit = 2
  }

@fabionovais
Copy link

The following solution works for me.

Change the paths to aws config file and credential file from:

provider "aws" {
  region = "us-east-2"
  shared_config_files=["~/.aws/config"] # Or $HOME/.aws/config
  shared_credentials_files = ["~/.aws/credentials"] # Or $HOME/.aws/credentials
  profile = "default"
}

to

provider "aws" {
  region = "us-east-2"
  shared_config_files=["/Users/me/.aws/config"]
  shared_credentials_files = ["/Users/me/.aws/credentials"]
  profile = "default"
}

thanks @dr-travis this solution was ok form me

@Akupsmee
Copy link

The following solution works for me.

Change the paths to aws config file and credential file from:

provider "aws" {
  region = "us-east-2"
  shared_config_files=["~/.aws/config"] # Or $HOME/.aws/config
  shared_credentials_files = ["~/.aws/credentials"] # Or $HOME/.aws/credentials
  profile = "default"
}

to

provider "aws" {
  region = "us-east-2"
  shared_config_files=["/Users/me/.aws/config"]
  shared_credentials_files = ["/Users/me/.aws/credentials"]
  profile = "default"
}

worked for me after running "aws configure" command

@github-actions
Copy link

This functionality has been released in v4.1.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@software-engr-full-stack

I don't know if this will help. I've been using Terraform Cloud as my back end. I changed the workspace execution mode from "remote" to "local" and it worked. I didn't change any versions. I'm using whatever version terraform init installed which was hashicorp/aws v4.6.0 as of this writing.

@github-actions
Copy link

github-actions bot commented May 7, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
authentication Pertains to authentication; to the provider itself of otherwise. provider Pertains to the provider itself, rather than any interaction with AWS. regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. upstream Addresses functionality related to the cloud provider.
Projects
None yet
Development

Successfully merging a pull request may close this issue.