-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin did not respond error - when using terraform apply and connected to company VPN #30532
Comments
This appears to be a Terraform Core issue, since it appears to be unable to communicate with the plugin - transferring this there. |
@tombuildsstuff Wouldn't this message indicate a problem resolving DNS? What I am trying to understand is how this (apparent) DNS resolution failure ties back to the GRPC failures in the "Panic Output". Thanks for any additional info. Debug Output
|
It does seem like there is a mixture of things here and so it's not clear yet which is the root cause, but we can see for certain that Terraform Core was able to launch the provider and it was able to start doing its work, because we see a log line from it. This other set of messages seems to suggest a timeout where We have seen symptoms similar to these Terraform Core error messages before on Windows systems where firewall software blocks Terraform Core from connecting to the plugin's RPC interface, so if that DNS error is the result of something the provider does during its own init, before getting any requests from Terraform Core, that could suggest that there are two separate problems here, both being reported concurrently: the provider can't access this service due to a DNS problem, and Terraform Core can't access the plugin due to a firewall. Both of these can be reasonable outcomes from an interfering firewall or similar network middlebox though, so if that is the cause of both then unfortunately there may be nothing we can do on our end, and instead it would mean configuring that firewall/etc differently to allow Terraform to do its work. 😖 |
@apparentlymart that final paragraph is very frustrating if that's the case! I'll have a word with my IT guys and see if they have changed any rules regarding the firewall + VPN. Thanks for looking into this in the meantime and apologies if I haven't got this error across properly. PS - I'm on Mac, so the Windows system theory can be forgotten, unless there is a similar issue on Mac. |
Hi @mpigram, For the Azure provider error in particular, it could be that the particular version of the Azure provider you are using is compiled in such a way that it isn't able to access the macOS system resolver, which has been a common problem in the past for Go-based software on macOS, as discussed in golang/go#12524. Building an executable which has correct DNS behavior on macOS requires some special care Go toolchain usage, and requires building on macOS itself rather than cross-compiling. Recent versions of Terraform CLI should be built to support this correctly, but since the provider plugin releases are separate I can't be sure whether that particular Azure provider release will exhibit correct DNS behavior on macOS. If that is the cause of the DNS-related error then it may be best for us to split this issue in two parts and move the DNS-related issue back into the Azure provider repository, since we won't be able to do anything to improve that situation by changes in Terraform Core. I'm still curious to understand why a DNS error in the provider would lead to a timeout reported by Terraform Core, so I think there is still something to be understood here, but hopefully this DNS-resolver-related issue something we can more quickly determine without a lot of deep debugging, if the Azure provider team knows which of the provider releases have correct macOS resolver support. |
@apparentlymart Thanks for the breakdown Would rolling back a version or two help with the goland compile issue on the Azure Provider? As a fix for the time being? Feel free to split this up accordingly, I'm happy with that! 👍 |
Hi @mpigram, I'm unfortunately not a macOS user myself and so I've been struggling a little to try to prove either way whether my theory about the cause of the DNS resolution failure holds. It seems like determining that requires running an executable on a macOS system with certain special environment variables enabled, and those environment variables cause the Go runtime to emit extra information that would not be visible in the context of a Terraform plugin because its output streams are not connected to the terminal. I think I will need to halt investigating here for now and let one of my colleagues who does use macOS -- or, alternatively, is familiar enough with the Azure provider release process to know whether it's built with CGo enabled on macOS -- to confirm or deny my theory. Since I don't have a way to test whether a particular executable is built differently, I can't say for certain whether there will be another version of the provider that you could use at this time. 😖 |
@tombuildsstuff do you know the answer to the question? If not we can refer to the release eng team.
|
@crw all of the Provider binaries are cross-compiled on Linux (with CGO enabled, we don't manually enable/disable this iirc) - we do not build on macOS at this time. |
@tombuildsstuff this has only been happening recently...maybe 2-3 weeks. Is this due to a recent update then or would you say this would've been the expected behaviour for a while? |
@mpigram nothing's changed within the Provider on that front in the last ~4 months: https://github.com/hashicorp/terraform-provider-azurerm/blob/main/.go-version - although we'll likely be updating that in the near future fwiw. Since you mention this has happened in the last few weeks - have you updated any surrounding software (macOS/Terraform Core etc) during that time period / are you running any endpoint security software which maybe intercepting/delaying the launch of the Provider? |
If the provider's macOS releases are cross-compiled from Linux then I think golang/go#12524 is the most likely root cause here: unless taking some very unusual steps in the build process (such as the things I was summarizing in golang/go#12524 (comment)), there isn't really any practical way to produce an executable which has correct macOS DNS resolution behavior when cross-compiling, because the macOS C toolchain is only available on macOS itself, so a CGo-enabled build from a Linux system would fail to find the necessary headers. Out of curiosity I just tried it on my own Linux system and it seems that the problem is more fundamental than just headers for me; the build process seems to be including a $ GOOS=darwin GOARCH=amd64 CGO_ENABLED=1 go build .
# runtime/cgo
gcc: error: x86_64: No such file or directory
gcc: error: unrecognized command line option '-arch' $ CC=clang GOOS=darwin GOARCH=amd64 CGO_ENABLED=1 go build -o provider .
# runtime/cgo
clang: error: argument unused during compilation: '-arch x86_64' [-Werror,-Wunused-command-line-argument] As far as I know, for any program that resolves hostnames the only supported way to produce a correctly-functioning executable for macOS is to build on macOS. 😖 |
The unmarshall DNS issue should be resolved with the latest Go release, since this was fixed: golang/go#51127 The split DNS issue on macOS is not the issue. |
CGO is automatically disabled for cross building... |
This PR tracks upgrading the AzureRM Provider to use Go 1.18: hashicorp/terraform-provider-azurerm#15902 Whilst we can't commit to a timeframe for building/cross-compiling from macOS, this is something we've got planned fwiw |
Thanks for confirming, @tombuildsstuff. At this point then it seems like what remains for this issue is to determine why this problem appeared as the "Plugin did not respond" error, rather than as e.g. a DNS resolution error from the provider. We can see in the debug output that the Azure provider did start up, did try to make an outgoing request, and did get back a DNS resolution failure:
What should typically happen in that case is that the provider would return a similar error message back to Terraform Core and then Terraform Core would show it, but in this case it seems like the real error got swallowed somewhere and Terraform Core treated it as a generic timeout instead. That could either be a bug in Terraform Core or a bug in the provider. It would be a bug in Terraform Core if the provider did return the error but Terraform Core didn't handle it. It would be a bug in the provider if the provider itself swallowed the error and deadlocked itself, rather than reporting the error. Given that we don't yet have a way to reproduce this outside of the system where it was originally seen, I think a next step here would be to try to identify where in the Azure provider that error emerges and review how the provider handles it. If we can see a clear path from the Azure SDK (presumably) generating the error to the provider returning it then that would suggest that Terraform Core is the one responsible for the problem. |
The latest release of the AzureRM Provider (3.1.0) builds using Go 1.18 - can you take a look and see if that solves this for you @mpigram? |
Hi again @mpigram! We didn't hear back from you after the request to try this with AzureRM Provider 3.1.0, so I'm going to close this under the assumption that this is no longer a problem for you. Reviewing the discussion above it seems to me that the only concrete problem we were able to establish was the Azure provider itself interacting with some Go standard library bugs, and the Azure provider team has attempted to fix the part of the problem identified above by building with a newer version of Go. There was also the question of whether there's a Terraform Core bug preventing the error from the provider from being shown as a real error rather than as a communication error, but we've not heard any other reports of similar problems in other situations and we don't have an isolated reproduction of it here, so I don't expect we'll be able to make any further progress on this as a Terraform Core issue. We've not yet confirmed that the provider-side problem is fixed, but if not then the Azure provider repository would be a better place to continue discussing that. If you're someone else finding this comment some time later because you've encountered an error with similar error text, I'd suggest starting by reporting an issue against the provider identified in the error message, in the provider's own GitHub repository. If a provider team is then able to use such a report as an example of Terraform Core swallowing an error diagnostic returned by the provider, I'd be grateful if that team would open a new issue in this repository showing the reproduction case, and then we'll investigate further. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Terraform (and AzureRM Provider) Version
Terraform v1.1.5
on darwin_amd64
Affected Resource(s)
Terraform Configuration Files
Debug Output
Panic Output
Expected Behaviour
An apply should just start and give option for yes or not to deploy
Actual Behaviour
It is hanging on the apply command and then I have to cancel twice to stop it. Then it outputs the error message I pasted in the panic output
Steps to Reproduce
Connect to my companies VPN, run terraform apply.
terraform apply
Important Factoids
References
The text was updated successfully, but these errors were encountered: