-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: DNS broken on darwin without cgo (1.13 regression) #31705
Comments
My best guess is f6b42a5 ("net: use libSystem bindings for DNS resolution on macos if cgo is unavailable"). We might need some more test coverage. Or a no-cgo darwin builder. |
FWIW I can't seem to reproduce with a binary compiled on a 10.14.4 mac. I tested building with cgo disabled and setting |
I built on Linux and ran on a Mac, without setting any special environment variables. |
Could this be because when you cross compile on Linux the linker doesn't have access to libSystem? Not sure how this done for every other binding when there's cross compilation Also, this is with CGO enabled, netcgo not specified, cross compiled for darwin on linux? |
We just disabled cgo support for darwin/386 (per #31751) so we now have a CGO_ENABLED=0 Mac builder, which now hits this issue. Which is good in that we can reproduce it. Looks like it's stuck in DNS queries, so f6b42a5 looks implicated. https://build.golang.org/log/289a154e730768cccbc64dd0ea2af16b4b48db88
|
I don't think this should matter. We don't actually need access to |
Change https://golang.org/cl/174637 mentions this issue: |
Also remove dead nacl-arm. It hasn't run in ages. And update netbsd comment about why 386 doesn't run. And correct its VM image name. Updates golang/go#31705 Updates golang/go#31726 Change-Id: I9de4605f34a052d0a84684fca098388d75602a82 Reviewed-on: https://go-review.googlesource.com/c/build/+/174637 Reviewed-by: Brad Fitzpatrick <[email protected]>
This seems to be the same failure on the https://build.golang.org/log/d3e5f74f9924af70e34b2de05f64cb9cdfb41310 |
My theory is that this has to do with the build constraints here. Perhaps CgoEnabled isn't checked correctly? I know that cgo is disabled when cross compiling. Is reproducing this just cross compiling from macos to linux? |
You can reproduce this on a Mac, without cross compiling. Just build with CGO_ENABLED=0. |
Easy way to repro on a mac: Works:
Hangs:
The hang is somewhere inside The path that hangs is:
Note that the libSystem call to @randall77, any ideas? |
When I run the test above and look at logs I seem to run into error messages related to I wonder if the issue occurs when the binary is code signed. |
I didn't sign any binary. I'm not aware of any part of the Go build process that automatically signs binaries, either? |
I'm suggesting that a signed binary might succeed while an unsigned one will get blocked by the kernel. Sorry for the confusion. I'll test my theory. |
Can some share a reproducible |
(the code signing logs were not an issue) I modified |
Looking at the build tags. Lines 5 to 6 in 14491a2
but the stub which is getting invoked is the one with
which was added as part of f6b42a5#diff-612b16c746d269bd07f162f3fd6ea47eR6 The test is still expecting to reach This "fixes" the issue by making the tags match up:
I'm not sure what the correct solution is, but it doesn't seem like this test should pass on darwin anymore, since it's expecting the placeholder, not an actual function call.
|
If we run the test on
The @groob in addition to the build tag change, that test should also run if on darwin AND cgo is disabled AND netgo is enabled. Regardless it doesn't explain why res_search is hanging instead of causing the test to fail. Finally have time to investigate this now, sorry for being MIA. |
It's strange that a regular go program with the same calls as the test doesn't hang. Is anyone familiar with how go test binaries link/load differently from normal executables (if at all)? |
Still only affects test binaries. |
Been sidetracked on this, but I just tried a test with a test and regular binary each just calling The question for me right now is how can we get logs/debug the linker so we can figure out what the difference is between Anyone have particularly intimate knowledge of |
You can see exactly what the go tool is doing by using the |
@groob should the 10.15 crash be a separate bug? |
@rsc possibly, it's still the net.res_search call so I was making a note in here. 10.15 is still a very early beta, I'll keep this in mind and file bugs against Go as we get closer in the test cycle for macOS and file bugs then. |
When running a go build with
while
|
Change https://golang.org/cl/180779 mentions this issue: |
Change https://golang.org/cl/180781 mentions this issue: |
Using delve to debug C code is probably not a good strategy. (It's a Go debugger, not a C debugger.)
This makes pretty clear that the function has pushed a 64 kB stack frame onto the stack. So how big is the stack anyway?
That TODO is sounding pretty good. Sent CL 180779. After that CL I get
That seems to be something different. Also found that we were not passing a NUL-terminated string to res_search, but maybe it was accidentally NUL-terminated, because explicitly fixing that did not help. (CL 180781). |
I intermittently get |
@groob, the main thread was running with an 8 MB stack and all other threads were running with 64 kB. The intermittency was based on which stack you got. The 8 MB one would run to completion and return the error; the smaller ones would fault in C due to the stack overflow and then fault again in Go trying to understand the C fault. |
I thought y'all just wanted me to make the test not hang. Actually passing is a higher bar. Will try. :-) |
@rsc |
OK, so at least part of the problem is that if you write a C program calling res_search, you actually get symbol res_9_search from /usr/lib/libresolv.dylib, while we are using res_search from /usr/lib/system/libsystem_info.dylib. Does anyone know why we are using the libsystem_info version? |
@grantseltzer, I notice CL 166297 has no tests. What did you use to test it? Also, the CL claimed to be fixing #12524, which was a very old bug asking for support for the /etc/resolver directory. My Mac has no /etc/resolver directory. Also, the code that looks in /etc/resolver appears to be in libresolv, but as noted in my previous comment Go is using libsystem_info. I have not chased down whether libsystem_info ever ends up back in libresolv, but so far I've seen no evidence that it does. So it may be that the new code does not work with /etc/resolver. I can't tell. |
I don't think the res_search in /usr/lib/system/libsystem_info.dylib ever makes it to res_9_search in /usr/lib/libresolv.dylib. The following files all just reference each other, there's no path to libresolv.dylib from our root (libSystem.B.dylib):
|
It wouldn't be hard to try adding libresolv.dylib to our imports and renaming our resolver from res_search to res_9_search. |
Change https://golang.org/cl/180838 mentions this issue: |
I can't figure out what CL 166297 intended to fix or why it was important to improve anything in non-cgo-only mode. I have a bunch of fixes for it that make it resolve names successfully, which I will send out, but then I will send a CL deleting it entirely. If at some later point someone wants to bring it back, that's fine, provided they explain why. |
Change https://golang.org/cl/180843 mentions this issue: |
Change https://golang.org/cl/180842 mentions this issue: |
@rsc the idea is so that darwin dns logic can be used instead of the netgo library for when cgo is disabled. There's a lot of overlap but the The problem I see is that when distributing binaries to darwin hosts it's likely most common to disable cgo. This means that before CL 166297 tools like the ones by hashicorp would not support In terms of testing, shamefully I was just doing it manually but will gladly work to write up tests. I'm taking a look at your CLs now, thanks for the help and feedback! |
@grantseltzer, I don't believe non-cgo builds of such tools are working at all today; literally all DNS lookups seem to fail, not just ones involving /etc/resolver. Even once Go-side bugs are fixed, libsystem_info's res_search fails at PTR and CNAME queries; it may also not be thread safe; and it appears not to pay any attention to /etc/resolver. Perhaps switching to libresolv will help; perhaps not. If you'd like to reintroduce a revised version of the code for Go 1.14, that's fine. For Go 1.13, though, we'll revert things to the way they were in Go 1.12. Thanks. |
At least one libc call we make (res_search, which calls _mdns_query and then mdns_item_call) pushes a 64 kB stack frame onto the stack. Then it faults on the guard page. Use the default system stack size, under the assumption that the C code being called is compatible with that stack size. For #31705. Change-Id: I1b0bfc2e54043c49f0709255988ef920ce30ee82 Reviewed-on: https://go-review.googlesource.com/c/go/+/180779 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
@rsc although your changes in 180842 make a lot of sense to me, i'm quite sure that they at least work enough to honor the With that in mind, it is a lot of added complexity for a corner case. I'll investigate to see how else I can get this to work, even if you are removing the code for now. Thanks to you as well. |
This code was added in April in CL 166297, for #12524. This CL fixes the following problems in the code: - The test for failure in the assembly stubs checked for 64-bit -1 instead of 32-bit -1 to decide to fetch errno. - These C routines (res_init and res_search) don't set errno anyway, so the Go code using errno to decide success is incorrect. (The routines set h_errno, which is a racy global variable that can't safely be consulted, storing values in a different error space.) - The Go call passed res_search a non-NUL-terminated name. - The C res_search rejects calls asking for TypeALL as opposed to more specific answers like TypeA/TypeAAAA/TypeCNAME, breaking cgoLookupHost in all cases and cgoLookupIP except with IP-version-specific networks. - The DNS response packet was parsed twice, once with msg.Unpack (discarded), and once with the lower-level dnsmessage.Parser. The Parser loop was missing a call to p.SkipAllQuestions, with the result that no DNS response packet would ever parse successfully. - The parsing of the DNS response answers, if reached, behaved as if that the AResource and AAAAResource record contained textual IP addresses, while in fact they contain binary ones. The calls to parseIPv4 and parseIPv6 therefore would always returns nil, so that no useful result would be returned from the resolver. With these fixes, cgoLookupIP can correctly resolve google.com and return both the A and AAAA addresses. Even after fixing all these things, TestGoLookupIP still fails, because it is testing that in non-cgo builds the cgo stubs correctly report "I can't handle the lookup", and as written the code intentionally violates that expectation. This CL adds new direct tests of the pseudo-cgo routines. The direct IP address lookups succeed, but the CNAME query causes res_search to hang, and the PTR query fails unconditionally (a trivial C program confirms these behaviors are due to res_search itself). Traditionally, res_search is only intended for single-threaded use. It is unclear whether this one is safe for use from multiple goroutines. If you run net.test under lldb, that causes syslog messages to be printed to standard error suggesting double-free bugs: 2019-06-05 19:52:43.505246-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD 2019-06-05 19:52:43.505274-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD 2019-06-05 19:52:43.505303-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD 2019-06-05 19:52:43.505329-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD This res_search is from libsystem_info; a normal C program would get res_search (#defined to res_9_search) from libresolv instead. It is unclear what the relation between the two is. Issue #12524 was about supporting the /etc/resolver directory tree, but only libresolv contains code for that; libsystem_info does not. So this code probably does not enable use of /etc/resolver. In short: - Before this CL, the code clearly had never run successfully. - The code appears not to improve upon the usual non-cgo fallback. - The code carries with it no tests of improved behavior. - The code breaks existing tests. - Calling res_search does not work for PTR/CNAME queries, so the code breaks existing behavior, even after this CL. - It's unclear whether res_search is safe to call from multiple threads. - It's unclear whether res_search is used by any other macOS programs. Given this, it probably makes sense to delete this code rather than rejigger the test. This CL fixes the code first, so that there is a working copy to bring back later if we find out that it really is necessary. For #31705. Change-Id: Id2e11e8ade43098b0f90dd4d16a62ca86a7a244a Reviewed-on: https://go-review.googlesource.com/c/go/+/180842 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Keith Randall <[email protected]>
I was testing some new code for the Go build system and found that a simple TCP dial doesn't work on Mac anymore, at least when the binary is cross-compiled.
Code is just:
... with a context.Background() for ctx.
It always times out after 10 seconds.
But if I redeploy the same code but built with Go 1.12.x, it works fine.
The text was updated successfully, but these errors were encountered: