Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin did not respond error - when using terraform apply and connected to company VPN #30532

Closed
mpigram opened this issue Feb 15, 2022 · 19 comments
Labels
bug v1.1 Issues (primarily bugs) reported against v1.1 releases waiting-response An issue/pull request is waiting for a response from the community

Comments

@mpigram
Copy link

mpigram commented Feb 15, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v1.1.5
on darwin_amd64

  • provider registry.terraform.io/hashicorp/azuread v1.2.2
  • provider registry.terraform.io/hashicorp/azurerm v2.96.0
  • provider registry.terraform.io/hashicorp/random v3.1.0

Affected Resource(s)

  • No resources affected, just provider will not authenticate when connected to our VPN

Terraform Configuration Files

erraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "=2.96.0"
    }
    azuread = {
      source  = "hashicorp/azuread"
      version = "~> 1.2.2"
    }
  }
}

provider "azurerm" {
  subscription_id = var.subscription_id
  features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

provider "azurerm" {
  alias           = "parent"
  subscription_id = var.dns_subscription_id
  skip_provider_registration = true
  features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

Debug Output

2022-02-16T14:17:25.402Z [DEBUG] provider.terraform-provider-azurerm_v2.96.0_x5: AzureRM Response Error: Get "https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01": dial tcp: lookup management.azure.com on 10.20.3.1:53: cannot unmarshal DNS message for https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01: timestamp=2022-02-16T14:17:25.402Z

Panic Output

╷
│ Error: operation canceled
│ 
│ 
╵
╷
│ Error: Plugin did not respond
│ 
│   with provider["registry.terraform.io/hashicorp/azurerm"],
│   on provider.tf line 14, in provider "azurerm":
│   14: provider "azurerm" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ConfigureProvider call. The plugin logs may contain more details.
╵
╷
│ Error: Plugin did not respond
│ 
│   with module.create_dns_names.provider["registry.terraform.io/hashicorp/azurerm"].parent,
│   on modules/dns/main.tf line 1, in provider "azurerm":
│    1: provider "azurerm" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ConfigureProvider call. The plugin logs may contain more details.
╵
╷
│ Error: Plugin did not respond
│ 
│   with module.create_redis.provider["registry.terraform.io/hashicorp/azurerm"],
│   on modules/redis/provider.tf line 1, in provider "azurerm":
│    1: provider "azurerm" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ConfigureProvider call. The plugin logs may contain more details.
╵
╷
│ Error: Plugin did not respond
│ 
│   with module.create_storage_account.provider["registry.terraform.io/hashicorp/azurerm"],
│   on modules/storage/provider.tf line 1, in provider "azurerm":
│    1: provider "azurerm" {
│ 
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ConfigureProvider call. The plugin logs may contain more details.

Expected Behaviour

An apply should just start and give option for yes or not to deploy

Actual Behaviour

It is hanging on the apply command and then I have to cancel twice to stop it. Then it outputs the error message I pasted in the panic output

Steps to Reproduce

Connect to my companies VPN, run terraform apply.

  1. terraform apply

Important Factoids

References

@tombuildsstuff
Copy link
Contributor

This appears to be a Terraform Core issue, since it appears to be unable to communicate with the plugin - transferring this there.

@tombuildsstuff tombuildsstuff transferred this issue from hashicorp/terraform-provider-azurerm Feb 16, 2022
@tombuildsstuff tombuildsstuff added the new new issue not yet triaged label Feb 16, 2022
@crw crw added the bug label Feb 16, 2022
@crw
Copy link
Contributor

crw commented Feb 16, 2022

@tombuildsstuff Wouldn't this message indicate a problem resolving DNS? What I am trying to understand is how this (apparent) DNS resolution failure ties back to the GRPC failures in the "Panic Output". Thanks for any additional info.

Debug Output

2022-02-16T14:17:25.402Z [DEBUG] provider.terraform-provider-azurerm_v2.96.0_x5: AzureRM Response Error: Get "https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01": dial tcp: lookup management.azure.com on 10.20.3.1:53: cannot unmarshal DNS message for https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01: timestamp=2022-02-16T14:17:25.402Z

@apparentlymart
Copy link
Contributor

It does seem like there is a mixture of things here and so it's not clear yet which is the root cause, but we can see for certain that Terraform Core was able to launch the provider and it was able to start doing its work, because we see a log line from it.

This other set of messages seems to suggest a timeout where ConfigureProvider didn't return before Terraform Core hit a deadline. It isn't clear to me yet why a DNS error in the provider would lead to a timeout in that response, rather than the provider just passing the error back to Terraform Core to be reported in the normal way.

We have seen symptoms similar to these Terraform Core error messages before on Windows systems where firewall software blocks Terraform Core from connecting to the plugin's RPC interface, so if that DNS error is the result of something the provider does during its own init, before getting any requests from Terraform Core, that could suggest that there are two separate problems here, both being reported concurrently: the provider can't access this service due to a DNS problem, and Terraform Core can't access the plugin due to a firewall.

Both of these can be reasonable outcomes from an interfering firewall or similar network middlebox though, so if that is the cause of both then unfortunately there may be nothing we can do on our end, and instead it would mean configuring that firewall/etc differently to allow Terraform to do its work. 😖

@mpigram
Copy link
Author

mpigram commented Feb 17, 2022

@apparentlymart that final paragraph is very frustrating if that's the case! I'll have a word with my IT guys and see if they have changed any rules regarding the firewall + VPN.

Thanks for looking into this in the meantime and apologies if I haven't got this error across properly.

PS - I'm on Mac, so the Windows system theory can be forgotten, unless there is a similar issue on Mac.

@apparentlymart
Copy link
Contributor

Hi @mpigram,

For the Azure provider error in particular, it could be that the particular version of the Azure provider you are using is compiled in such a way that it isn't able to access the macOS system resolver, which has been a common problem in the past for Go-based software on macOS, as discussed in golang/go#12524. Building an executable which has correct DNS behavior on macOS requires some special care Go toolchain usage, and requires building on macOS itself rather than cross-compiling. Recent versions of Terraform CLI should be built to support this correctly, but since the provider plugin releases are separate I can't be sure whether that particular Azure provider release will exhibit correct DNS behavior on macOS.

If that is the cause of the DNS-related error then it may be best for us to split this issue in two parts and move the DNS-related issue back into the Azure provider repository, since we won't be able to do anything to improve that situation by changes in Terraform Core.

I'm still curious to understand why a DNS error in the provider would lead to a timeout reported by Terraform Core, so I think there is still something to be understood here, but hopefully this DNS-resolver-related issue something we can more quickly determine without a lot of deep debugging, if the Azure provider team knows which of the provider releases have correct macOS resolver support.

@mpigram
Copy link
Author

mpigram commented Feb 17, 2022

@apparentlymart Thanks for the breakdown

Would rolling back a version or two help with the goland compile issue on the Azure Provider? As a fix for the time being?

Feel free to split this up accordingly, I'm happy with that! 👍

@apparentlymart
Copy link
Contributor

Hi @mpigram,

I'm unfortunately not a macOS user myself and so I've been struggling a little to try to prove either way whether my theory about the cause of the DNS resolution failure holds. It seems like determining that requires running an executable on a macOS system with certain special environment variables enabled, and those environment variables cause the Go runtime to emit extra information that would not be visible in the context of a Terraform plugin because its output streams are not connected to the terminal.

I think I will need to halt investigating here for now and let one of my colleagues who does use macOS -- or, alternatively, is familiar enough with the Azure provider release process to know whether it's built with CGo enabled on macOS -- to confirm or deny my theory.

Since I don't have a way to test whether a particular executable is built differently, I can't say for certain whether there will be another version of the provider that you could use at this time. 😖

@crw
Copy link
Contributor

crw commented Feb 18, 2022

@tombuildsstuff do you know the answer to the question? If not we can refer to the release eng team.

is familiar enough with the Azure provider release process to know whether it's built with CGo enabled on macOS

@tombuildsstuff
Copy link
Contributor

@crw all of the Provider binaries are cross-compiled on Linux (with CGO enabled, we don't manually enable/disable this iirc) - we do not build on macOS at this time.

@mpigram
Copy link
Author

mpigram commented Mar 7, 2022

@tombuildsstuff this has only been happening recently...maybe 2-3 weeks. Is this due to a recent update then or would you say this would've been the expected behaviour for a while?

@tombuildsstuff
Copy link
Contributor

@mpigram nothing's changed within the Provider on that front in the last ~4 months: https://github.com/hashicorp/terraform-provider-azurerm/blob/main/.go-version - although we'll likely be updating that in the near future fwiw.

Since you mention this has happened in the last few weeks - have you updated any surrounding software (macOS/Terraform Core etc) during that time period / are you running any endpoint security software which maybe intercepting/delaying the launch of the Provider?

@apparentlymart
Copy link
Contributor

If the provider's macOS releases are cross-compiled from Linux then I think golang/go#12524 is the most likely root cause here: unless taking some very unusual steps in the build process (such as the things I was summarizing in golang/go#12524 (comment)), there isn't really any practical way to produce an executable which has correct macOS DNS resolution behavior when cross-compiling, because the macOS C toolchain is only available on macOS itself, so a CGo-enabled build from a Linux system would fail to find the necessary headers.


Out of curiosity I just tried it on my own Linux system and it seems that the problem is more fundamental than just headers for me; the build process seems to be including a -arch x86_64 argument that the C compilers available to me on Linux don't support:

$ GOOS=darwin GOARCH=amd64 CGO_ENABLED=1 go build .
# runtime/cgo
gcc: error: x86_64: No such file or directory
gcc: error: unrecognized command line option '-arch'
$ CC=clang GOOS=darwin GOARCH=amd64 CGO_ENABLED=1 go build -o provider .
# runtime/cgo
clang: error: argument unused during compilation: '-arch x86_64' [-Werror,-Wunused-command-line-argument]

As far as I know, for any program that resolves hostnames the only supported way to produce a correctly-functioning executable for macOS is to build on macOS. 😖

@crw crw removed the new new issue not yet triaged label Mar 9, 2022
@archoversight
Copy link

The unmarshall DNS issue should be resolved with the latest Go release, since this was fixed: golang/go#51127

The split DNS issue on macOS is not the issue.

@archoversight
Copy link

@crw all of the Provider binaries are cross-compiled on Linux (with CGO enabled, we don't manually enable/disable this iirc) - we do not build on macOS at this time.

CGO is automatically disabled for cross building...

@tombuildsstuff
Copy link
Contributor

This PR tracks upgrading the AzureRM Provider to use Go 1.18: hashicorp/terraform-provider-azurerm#15902

Whilst we can't commit to a timeframe for building/cross-compiling from macOS, this is something we've got planned fwiw

@apparentlymart
Copy link
Contributor

Thanks for confirming, @tombuildsstuff.

At this point then it seems like what remains for this issue is to determine why this problem appeared as the "Plugin did not respond" error, rather than as e.g. a DNS resolution error from the provider.

We can see in the debug output that the Azure provider did start up, did try to make an outgoing request, and did get back a DNS resolution failure:

2022-02-16T14:17:25.402Z [DEBUG] provider.terraform-provider-azurerm_v2.96.0_x5: AzureRM Response Error: Get "https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01": dial tcp: lookup management.azure.com on 10.20.3.1:53: cannot unmarshal DNS message for https://management.azure.com/subscriptions/b091e1a3-5af3-482d-b245-5734af84f707/providers?api-version=2016-02-01: timestamp=2022-02-16T14:17:25.402Z

What should typically happen in that case is that the provider would return a similar error message back to Terraform Core and then Terraform Core would show it, but in this case it seems like the real error got swallowed somewhere and Terraform Core treated it as a generic timeout instead.

That could either be a bug in Terraform Core or a bug in the provider. It would be a bug in Terraform Core if the provider did return the error but Terraform Core didn't handle it. It would be a bug in the provider if the provider itself swallowed the error and deadlocked itself, rather than reporting the error.

Given that we don't yet have a way to reproduce this outside of the system where it was originally seen, I think a next step here would be to try to identify where in the Azure provider that error emerges and review how the provider handles it. If we can see a clear path from the Azure SDK (presumably) generating the error to the provider returning it then that would suggest that Terraform Core is the one responsible for the problem.

@tombuildsstuff
Copy link
Contributor

The latest release of the AzureRM Provider (3.1.0) builds using Go 1.18 - can you take a look and see if that solves this for you @mpigram?

@crw crw added the waiting-response An issue/pull request is waiting for a response from the community label Apr 11, 2022
@apparentlymart apparentlymart added the v1.1 Issues (primarily bugs) reported against v1.1 releases label Sep 16, 2022
@apparentlymart
Copy link
Contributor

Hi again @mpigram!

We didn't hear back from you after the request to try this with AzureRM Provider 3.1.0, so I'm going to close this under the assumption that this is no longer a problem for you.

Reviewing the discussion above it seems to me that the only concrete problem we were able to establish was the Azure provider itself interacting with some Go standard library bugs, and the Azure provider team has attempted to fix the part of the problem identified above by building with a newer version of Go.

There was also the question of whether there's a Terraform Core bug preventing the error from the provider from being shown as a real error rather than as a communication error, but we've not heard any other reports of similar problems in other situations and we don't have an isolated reproduction of it here, so I don't expect we'll be able to make any further progress on this as a Terraform Core issue. We've not yet confirmed that the provider-side problem is fixed, but if not then the Azure provider repository would be a better place to continue discussing that.

If you're someone else finding this comment some time later because you've encountered an error with similar error text, I'd suggest starting by reporting an issue against the provider identified in the error message, in the provider's own GitHub repository. If a provider team is then able to use such a report as an example of Terraform Core swallowing an error diagnostic returned by the provider, I'd be grateful if that team would open a new issue in this repository showing the reproduction case, and then we'll investigate further. Thanks!

@apparentlymart apparentlymart closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2022
@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug v1.1 Issues (primarily bugs) reported against v1.1 releases waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

No branches or pull requests

5 participants