Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: DNS broken on darwin without cgo (1.13 regression) #31705

Closed
bradfitz opened this issue Apr 26, 2019 · 52 comments
Closed

net: DNS broken on darwin without cgo (1.13 regression) #31705

bradfitz opened this issue Apr 26, 2019 · 52 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin release-blocker
Milestone

Comments

@bradfitz
Copy link
Contributor

I was testing some new code for the Go build system and found that a simple TCP dial doesn't work on Mac anymore, at least when the binary is cross-compiled.

Code is just:

var coordDialer = &net.Dialer{
        Timeout:   10 * time.Second,
        KeepAlive: 15 * time.Second,
}       

// dialCoordinatorTCP returns a TCP connection to the coordinator, making                                                                                                                                                                                          
// a CONNECT request to a proxy as a fallback.                                                                                                                                                                                                                     
func dialCoordinatorTCP(ctx context.Context, addr string) (net.Conn, error) {
        tcpConn, err := coordDialer.DialContext(ctx, "tcp", addr)

... with a context.Background() for ctx.

It always times out after 10 seconds.

But if I redeploy the same code but built with Go 1.12.x, it works fine.

@bradfitz bradfitz added help wanted OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Apr 26, 2019
@bradfitz bradfitz added this to the Go1.13 milestone Apr 26, 2019
@bradfitz
Copy link
Contributor Author

My best guess is f6b42a5 ("net: use libSystem bindings for DNS resolution on macos if cgo is unavailable").

We might need some more test coverage. Or a no-cgo darwin builder.

/cc @ianlancetaylor @grantseltzer

@bradfitz bradfitz changed the title net: buildlet doesn't work on darwin-amd64 with Go master net: cross-compiled cgo-less buildlet doesn't work on darwin-amd64 with Go master Apr 26, 2019
@groob
Copy link
Contributor

groob commented Apr 28, 2019

FWIW I can't seem to reproduce with a binary compiled on a 10.14.4 mac. I tested building with cgo disabled and setting GODEBUG=netdns=go.

@bradfitz
Copy link
Contributor Author

I built on Linux and ran on a Mac, without setting any special environment variables.

@grantseltzer
Copy link
Contributor

grantseltzer commented Apr 29, 2019

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

Not sure how this done for every other binding when there's cross compilation

Also, this is with CGO enabled, netcgo not specified, cross compiled for darwin on linux?

@bradfitz
Copy link
Contributor Author

@randall77?

@bradfitz
Copy link
Contributor Author

We just disabled cgo support for darwin/386 (per #31751) so we now have a CGO_ENABLED=0 Mac builder, which now hits this issue. Which is good in that we can reproduce it.

Looks like it's stuck in DNS queries, so f6b42a5 looks implicated.

https://build.golang.org/log/289a154e730768cccbc64dd0ea2af16b4b48db88

ok  	mime	0.017s
ok  	mime/multipart	0.365s
ok  	mime/quotedprintable	0.112s
panic: test timed out after 3m0s

goroutine 343 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1380 +0xc5
created by time.goFunc
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/time/sleep.go:169 +0x31

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0x11720f00, 0x239388, 0xf, 0x2482c8, 0x1)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:964 +0x2c5
testing.runTests.func1(0x114da000)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1205 +0x54
testing.tRunner(0x114da000, 0x11498f10)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
testing.runTests(0x114a8020, 0x3eef00, 0xdf, 0xdf, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1203 +0x227
testing.(*M).Run(0x11474200, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1120 +0x111
net.TestMain(0x11474200)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/main_test.go:52 +0x25
main.main()
	_testmain.go:552 +0xfa

goroutine 612 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGoogleSRV(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:70 +0x29
testing.tRunner(0x11720aa0, 0x2482f0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 613 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailMX(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:119 +0x1f
testing.tRunner(0x11720b40, 0x2482d8)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 614 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailNS(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:165 +0x1f
testing.tRunner(0x11720be0, 0x2482dc)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 615 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailTXT(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:214 +0x29
testing.tRunner(0x11720c80, 0x2482e0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 619 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6
FAIL	net	180.028s

@randall77
Copy link
Contributor

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

I don't think this should matter. We don't actually need access to libSystem to build a binary which dynamically links to it. Building on Linux and running on a Mac should work fine with regards to this feature.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/174637 mentions this issue: dashboard: add darwin-amd64-nocgo config, remove nacl-386 trybot

gopherbot pushed a commit to golang/build that referenced this issue Apr 30, 2019
Also remove dead nacl-arm. It hasn't run in ages.

And update netbsd comment about why 386 doesn't run. And correct its
VM image name.

Updates golang/go#31705
Updates golang/go#31726

Change-Id: I9de4605f34a052d0a84684fca098388d75602a82
Reviewed-on: https://go-review.googlesource.com/c/build/+/174637
Reviewed-by: Brad Fitzpatrick <[email protected]>
@bcmills
Copy link
Contributor

bcmills commented May 6, 2019

This seems to be the same failure on the darwin-386-10_14 buildlet; retitling accordingly.

https://build.golang.org/log/d3e5f74f9924af70e34b2de05f64cb9cdfb41310

@bcmills bcmills changed the title net: cross-compiled cgo-less buildlet doesn't work on darwin-amd64 with Go master net: cross-compiled cgo-less buildlet doesn't work on darwin with Go master May 6, 2019
@grantseltzer
Copy link
Contributor

My theory is that this has to do with the build constraints here. Perhaps CgoEnabled isn't checked correctly? I know that cgo is disabled when cross compiling.

Is reproducing this just cross compiling from macos to linux?

@bradfitz bradfitz changed the title net: cross-compiled cgo-less buildlet doesn't work on darwin with Go master net: DNS broken on darwin without cgo (1.13 regression) May 6, 2019
@bradfitz
Copy link
Contributor Author

bradfitz commented May 6, 2019

Is reproducing this just cross compiling from macos to linux?

You can reproduce this on a Mac, without cross compiling. Just build with CGO_ENABLED=0.

@bradfitz
Copy link
Contributor Author

Easy way to repro on a mac:

Works:

barloga5k:net $ CGO_ENABLED=1 go test -v -short -run=TestGoLookupIP net 
=== RUN   TestGoLookupIPWithResolverConfig
--- PASS: TestGoLookupIPWithResolverConfig (0.03s)
=== RUN   TestGoLookupIPOrderFallbackToFile
--- PASS: TestGoLookupIPOrderFallbackToFile (0.00s)
PASS
ok  	net

Hangs:

barloga5k:net $ CGO_ENABLED=0 go test -v -short -run=TestGoLookupIP net 
=== RUN   TestGoLookupIPWithResolverConfig
--- PASS: TestGoLookupIPWithResolverConfig (0.03s)
=== RUN   TestGoLookupIPOrderFallbackToFile
--- PASS: TestGoLookupIPOrderFallbackToFile (0.00s)
=== RUN   TestGoLookupIP

(hang)

^\SIGQUIT: quit
PC=0x7fff7b71c1b2 m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x149b788, 0x149b748, 0x7ffe00000000)
	/Users/bradfitz/go/src/runtime/sys_darwin.go:368 +0x39
runtime.semasleep(0xffffffffffffffff, 0x104afac)
	/Users/bradfitz/go/src/runtime/os_darwin.go:63 +0x85
runtime.notesleep(0x149b548)
	/Users/bradfitz/go/src/runtime/lock_sema.go:173 +0xe0
runtime.stopm()
	/Users/bradfitz/go/src/runtime/proc.go:1928 +0xc0
runtime.findrunnable(0xc00001e000, 0x0)
	/Users/bradfitz/go/src/runtime/proc.go:2391 +0x53f
runtime.schedule()
	/Users/bradfitz/go/src/runtime/proc.go:2524 +0x2be
runtime.park_m(0xc000062600)
	/Users/bradfitz/go/src/runtime/proc.go:2610 +0x9d
runtime.mcall(0x105a796)
	/Users/bradfitz/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000116200, 0x12ad59e, 0xe, 0x12bcfc0, 0x10a9c01)
	/Users/bradfitz/go/src/testing/testing.go:964 +0x377
testing.runTests.func1(0xc000116000)
	/Users/bradfitz/go/src/testing/testing.go:1210 +0x78
testing.tRunner(0xc000116000, 0xc00007ae08)
	/Users/bradfitz/go/src/testing/testing.go:912 +0xbf
testing.runTests(0xc00000c040, 0x14972c0, 0xdf, 0xdf, 0x0)
	/Users/bradfitz/go/src/testing/testing.go:1208 +0x2a7
testing.(*M).Run(0xc0000de000, 0x0)
	/Users/bradfitz/go/src/testing/testing.go:1125 +0x160
net.TestMain(0xc0000de000)
	/Users/bradfitz/go/src/net/main_test.go:52 +0x39
main.main()
	_testmain.go:554 +0x135

goroutine 73 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/Users/bradfitz/go/src/testing/testing.go:963 +0x350

rax    0x104
rbx    0x2
rcx    0x7ffeefbff548
rdx    0xb00
rdi    0x149b788
rsi    0x290100002a00
rbp    0x7ffeefbff5d0
rsp    0x7ffeefbff548
r8     0x0
r9     0xa0
r10    0x0
r11    0x202
r12    0x149b788
r13    0x16
r14    0x290100002a00
r15    0x882f5c0
rip    0x7fff7b71c1b2
rflags 0x203
cs     0x7
fs     0x0
gs     0x0
FAIL	net	13.181s
FAIL

The hang is somewhere inside res_search.

The path that hangs is:

netgo_unix_test.go => cgo_darwin_stub.go : func cgoLookupIP (misleading name, supposed to only conditionally use cgo) => func resolverGetResources => res_search (defined in runtime).

Note that the libSystem call to res_init does succeed, at least, so it's not some libcCall cgo_import_dynamic problem in general.

@randall77, any ideas?

@groob
Copy link
Contributor

groob commented May 15, 2019

When I run the test above and look at logs I seem to run into error messages related to /var/db/DetachedSignatures, which is usually a code signing check. Similarly MacOS error: -67062.

I wonder if the issue occurs when the binary is code signed.

@bradfitz
Copy link
Contributor Author

I didn't sign any binary. I'm not aware of any part of the Go build process that automatically signs binaries, either?

@groob
Copy link
Contributor

groob commented May 15, 2019

I'm suggesting that a signed binary might succeed while an unsigned one will get blocked by the kernel. Sorry for the confusion.

I'll test my theory.

@groob
Copy link
Contributor

groob commented May 15, 2019

Can some share a reproducible func main example? I can reproduce by running the test, but I've tried multiple combinations of CGO_ENABLED and GODEBUG=netdns and I can't get the issue to show up that way.

@groob
Copy link
Contributor

groob commented May 15, 2019

(the code signing logs were not an issue)

I modified runtime/lookup_darwin.go to add a println and I can see that I'm actually getting to the res_search function, but the function returns successfully an my binary works fine.

@groob
Copy link
Contributor

groob commented May 15, 2019

Looking at the build tags.
The failing test is set to have !cgo netgo and darwin,

// +build !cgo netgo
// +build darwin dragonfly freebsd linux netbsd openbsd solaris

but the stub which is getting invoked is the one with

// +build !netgo,!cgo
// +build darwin

which was added as part of f6b42a5#diff-612b16c746d269bd07f162f3fd6ea47eR6

The test is still expecting to reach cgo_stub.go which now has !darwin

This "fixes" the issue by making the tags match up:

diff --git a/src/net/netgo_unix_test.go b/src/net/netgo_unix_test.go
index c672d3e8eb..3cd85d2ccd 100644
--- a/src/net/netgo_unix_test.go
+++ b/src/net/netgo_unix_test.go
@@ -3,7 +3,7 @@
 // license that can be found in the LICENSE file.

 // +build !cgo netgo
-// +build darwin dragonfly freebsd linux netbsd openbsd solaris
+// +build !darwin dragonfly freebsd linux netbsd openbsd solaris

 package net

I'm not sure what the correct solution is, but it doesn't seem like this test should pass on darwin anymore, since it's expecting the placeholder, not an actual function call.

	_, err, ok := cgoLookupIP(ctx, "ip", host)
	if ok {
		t.Errorf("cgoLookupIP must be a placeholder")

@grantseltzer
Copy link
Contributor

grantseltzer commented May 21, 2019

If we run the test on darwin with:

CGO_ENABLED=0 go test -v -short -run=TestGoLookupIP net

The cgoLookupIP in cgo_darwin_stub.go is being built which is correct.

@groob in addition to the build tag change, that test should also run if on darwin AND cgo is disabled AND netgo is enabled.

Regardless it doesn't explain why res_search is hanging instead of causing the test to fail. Finally have time to investigate this now, sorry for being MIA.

@grantseltzer
Copy link
Contributor

It's strange that a regular go program with the same calls as the test doesn't hang. Is anyone familiar with how go test binaries link/load differently from normal executables (if at all)?

@grantseltzer
Copy link
Contributor

Stepping through with delve shows that I can't get past PUSHQ BP in res_search_trampoline

Screen Shot 2019-05-22 at 3 23 14 PM

Only appears to be one thread in use as well.

dtruss shows a loop of __semwait_signal and then psynch_cvwait

@groob
Copy link
Contributor

groob commented Jun 3, 2019

Still only affects test binaries.

@grantseltzer
Copy link
Contributor

Been sidetracked on this, but I just tried a test with a test and regular binary each just calling net.LookupIP (with proper build flags/cgo disabled). _res_search, the linked routine, isn't in the test binary but it is in the regular one. This pretty much confirms the test binaries are not linking properly.

The question for me right now is how can we get logs/debug the linker so we can figure out what the difference is between go test -c and go build.

Anyone have particularly intimate knowledge of go test ?

@ianlancetaylor
Copy link
Member

You can see exactly what the go tool is doing by using the -x option.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

@groob should the 10.15 crash be a separate bug?

@groob
Copy link
Contributor

groob commented Jun 5, 2019

@rsc possibly, it's still the net.res_search call so I was making a note in here.

10.15 is still a very early beta, I'll keep this in mind and file bugs against Go as we get closer in the test cycle for macOS and file bugs then.

@groob
Copy link
Contributor

groob commented Jun 5, 2019

When running a go build with -x I see that the go test output has the following output:

golang/pkg/tool/darwin_amd64/link -o $WORK/b001/net.test -importcfg $WORK/b001/importcfg.link -s -w -buildmode=exe

while go build is missing the -s -w arguments.

golang/pkg/tool/darwin_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe 

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/180779 mentions this issue: runtime: use default system stack size, not 64 kB, on non-cgo macOS

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/180781 mentions this issue: net: pass NUL-terminated host name to C res_search call on macOS

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

Using delve to debug C code is probably not a good strategy. (It's a Go debugger, not a C debugger.)

$ lldb net.test
(lldb) target create "net.test"
Current executable set to 'net.test' (x86_64).
(lldb) run -test.v -test.short -test.run=GoLookupIP
Process 71117 launched: '/Users/rsc/go/src/net/net.test' (x86_64)
Process 71117 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000be5fd70)
    frame #0: 0x00007fff6eaae081 libsystem_info.dylib`mdns_item_call + 33
libsystem_info.dylib`mdns_item_call:
->  0x7fff6eaae081 <+33>: movq   %rdi, -0x10020(%rbp)
    0x7fff6eaae088 <+40>: movl   %esi, -0x10024(%rbp)
    0x7fff6eaae08e <+46>: movq   %rdx, -0x10030(%rbp)
    0x7fff6eaae095 <+53>: movq   %rcx, -0x10038(%rbp)
Target 0: (net.test) stopped.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000be5fd70)
  * frame #0: 0x00007fff6eaae081 libsystem_info.dylib`mdns_item_call + 33
    frame #1: 0x00007fff6eab8b79 libsystem_info.dylib`_mdns_query + 169
    frame #2: 0x00007fff6eab8d96 libsystem_info.dylib`res_search + 54
    frame #3: 0x000000000405ea1d net.test`runtime.res_search_trampoline + 29
    frame #4: 0x000000000405d3e0 net.test`runtime.asmcgocall + 112
    frame #5: 0x000000000400bd94 net.test`net.res_search + 180
    frame #6: 0x0000000004139418 net.test`net.resolverGetResources + 376
    frame #7: 0x0000000004138e21 net.test`net.cgoLookupIP + 481
    frame #8: 0x00000000041c7fdf net.test`net.TestGoLookupIP + 207
    ...
(lldb) disas
libsystem_info.dylib`mdns_item_call:
    0x7fff6eaae060 <+0>:   pushq  %rbp
    0x7fff6eaae061 <+1>:   movq   %rsp, %rbp
    0x7fff6eaae064 <+4>:   subq   $0x10180, %rsp            ; imm = 0x10180 
    0x7fff6eaae06b <+11>:  movq   0x18(%rbp), %rax
    0x7fff6eaae06f <+15>:  movl   0x10(%rbp), %r10d
    0x7fff6eaae073 <+19>:  movq   0x3687af9e(%rip), %r11    ; (void *)0x00007fffa531d070: __stack_chk_guard
    0x7fff6eaae07a <+26>:  movq   (%r11), %r11
    0x7fff6eaae07d <+29>:  movq   %r11, -0x8(%rbp)
->  0x7fff6eaae081 <+33>:  movq   %rdi, -0x10020(%rbp)
    0x7fff6eaae088 <+40>:  movl   %esi, -0x10024(%rbp)
    0x7fff6eaae08e <+46>:  movq   %rdx, -0x10030(%rbp)
    0x7fff6eaae095 <+53>:  movq   %rcx, -0x10038(%rbp)

This makes pretty clear that the function has pushed a 64 kB stack frame onto the stack. So how big is the stack anyway?

	// Set the stack size we want to use.  64KB for now.
	// TODO: just use OS default size?
	const stackSize = 1 << 16
	if pthread_attr_setstacksize(&attr, stackSize) != 0 {
		write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate)))
		exit(1)
	}
	//mSysStatInc(&memstats.stacks_sys, stackSize) //TODO: do this?

That TODO is sounding pretty good. Sent CL 180779.

After that CL I get

--- FAIL: TestGoLookupIP (0.20s)
    netgo_unix_test.go:25: parsing/packing of this type isn't available yet
FAIL

That seems to be something different.

Also found that we were not passing a NUL-terminated string to res_search, but maybe it was accidentally NUL-terminated, because explicitly fixing that did not help. (CL 180781).

@groob
Copy link
Contributor

groob commented Jun 5, 2019

I intermittently get netgo_unix_test.go:25: parsing/packing of this type isn't available yet while running this test even before the changes in the CL above.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

@groob, the main thread was running with an 8 MB stack and all other threads were running with 64 kB. The intermittency was based on which stack you got. The 8 MB one would run to completion and return the error; the smaller ones would fault in C due to the stack overflow and then fault again in Go trying to understand the C fault.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

I thought y'all just wanted me to make the test not hang. Actually passing is a higher bar. Will try. :-)

@zchee
Copy link
Contributor

zchee commented Jun 5, 2019

@rsc
FYI, RES_DEBUG='debug' CGO_ENABLED=0 go test -v ... will display the _mdns_search log.(but just a little)

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

OK, so at least part of the problem is that if you write a C program calling res_search, you actually get symbol res_9_search from /usr/lib/libresolv.dylib, while we are using res_search from /usr/lib/system/libsystem_info.dylib.

Does anyone know why we are using the libsystem_info version?

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

@grantseltzer, I notice CL 166297 has no tests. What did you use to test it?

Also, the CL claimed to be fixing #12524, which was a very old bug asking for support for the /etc/resolver directory. My Mac has no /etc/resolver directory. Also, the code that looks in /etc/resolver appears to be in libresolv, but as noted in my previous comment Go is using libsystem_info. I have not chased down whether libsystem_info ever ends up back in libresolv, but so far I've seen no evidence that it does. So it may be that the new code does not work with /etc/resolver. I can't tell.

@randall77
Copy link
Contributor

I don't think the res_search in /usr/lib/system/libsystem_info.dylib ever makes it to res_9_search in /usr/lib/libresolv.dylib. The following files all just reference each other, there's no path to libresolv.dylib from our root (libSystem.B.dylib):

	/usr/lib/libobjc.A.dylib
	/usr/lib/libc++abi.dylib
	/usr/lib/libc++.1.dylib
	/usr/lib/libSystem.B.dylib
        /usr/lib/system/*.dylib

@randall77
Copy link
Contributor

It wouldn't be hard to try adding libresolv.dylib to our imports and renaming our resolver from res_search to res_9_search.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/180838 mentions this issue: net: skip questions before parsing answers

@rsc
Copy link
Contributor

rsc commented Jun 5, 2019

I can't figure out what CL 166297 intended to fix or why it was important to improve anything in non-cgo-only mode. I have a bunch of fixes for it that make it resolve names successfully, which I will send out, but then I will send a CL deleting it entirely. If at some later point someone wants to bring it back, that's fine, provided they explain why.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/180843 mentions this issue: net: remove non-cgo macOS resolver code

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/180842 mentions this issue: net: fix non-cgo macOS resolver code

@grantseltzer
Copy link
Contributor

@rsc the idea is so that darwin dns logic can be used instead of the netgo library for when cgo is disabled. There's a lot of overlap but the /etc/resolver files is the particular example.

The problem I see is that when distributing binaries to darwin hosts it's likely most common to disable cgo. This means that before CL 166297 tools like the ones by hashicorp would not support /etc/resolver files. At my company and those of others I talked to in person and on slack this was an issue.

In terms of testing, shamefully I was just doing it manually but will gladly work to write up tests. I'm taking a look at your CLs now, thanks for the help and feedback!

@rsc
Copy link
Contributor

rsc commented Jun 6, 2019

@grantseltzer, I don't believe non-cgo builds of such tools are working at all today; literally all DNS lookups seem to fail, not just ones involving /etc/resolver. Even once Go-side bugs are fixed, libsystem_info's res_search fails at PTR and CNAME queries; it may also not be thread safe; and it appears not to pay any attention to /etc/resolver. Perhaps switching to libresolv will help; perhaps not. If you'd like to reintroduce a revised version of the code for Go 1.14, that's fine. For Go 1.13, though, we'll revert things to the way they were in Go 1.12. Thanks.

gopherbot pushed a commit that referenced this issue Jun 6, 2019
At least one libc call we make
(res_search, which calls _mdns_query and then mdns_item_call)
pushes a 64 kB stack frame onto the stack.
Then it faults on the guard page.

Use the default system stack size, under the assumption
that the C code being called is compatible with that stack size.

For #31705.

Change-Id: I1b0bfc2e54043c49f0709255988ef920ce30ee82
Reviewed-on: https://go-review.googlesource.com/c/go/+/180779
Run-TryBot: Russ Cox <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
@grantseltzer
Copy link
Contributor

@rsc although your changes in 180842 make a lot of sense to me, i'm quite sure that they at least work enough to honor the /etc/resolver files. The thing I don't understand is that they're only working in non-test files. That behavior is worth exploring.

With that in mind, it is a lot of added complexity for a corner case. I'll investigate to see how else I can get this to work, even if you are removing the code for now. Thanks to you as well.

gopherbot pushed a commit that referenced this issue Jun 6, 2019
This code was added in April in CL 166297, for #12524.
This CL fixes the following problems in the code:

 - The test for failure in the assembly stubs checked for
   64-bit -1 instead of 32-bit -1 to decide to fetch errno.

 - These C routines (res_init and res_search) don't set errno anyway,
   so the Go code using errno to decide success is incorrect.
   (The routines set h_errno, which is a racy global variable
   that can't safely be consulted, storing values in a different
   error space.)

 - The Go call passed res_search a non-NUL-terminated name.

 - The C res_search rejects calls asking for TypeALL as opposed to
   more specific answers like TypeA/TypeAAAA/TypeCNAME,
   breaking cgoLookupHost in all cases and cgoLookupIP
   except with IP-version-specific networks.

 - The DNS response packet was parsed twice, once with msg.Unpack
   (discarded), and once with the lower-level dnsmessage.Parser.
   The Parser loop was missing a call to p.SkipAllQuestions, with the
   result that no DNS response packet would ever parse successfully.

 - The parsing of the DNS response answers, if reached, behaved as if
   that the AResource and AAAAResource record contained textual
   IP addresses, while in fact they contain binary ones. The calls to
   parseIPv4 and parseIPv6 therefore would always returns nil,
   so that no useful result would be returned from the resolver.

With these fixes, cgoLookupIP can correctly resolve google.com
and return both the A and AAAA addresses.

Even after fixing all these things, TestGoLookupIP still fails,
because it is testing that in non-cgo builds the cgo stubs
correctly report "I can't handle the lookup", and as written the
code intentionally violates that expectation.

This CL adds new direct tests of the pseudo-cgo routines.
The direct IP address lookups succeed, but the CNAME query
causes res_search to hang, and the PTR query fails unconditionally
(a trivial C program confirms these behaviors are due to res_search itself).

Traditionally, res_search is only intended for single-threaded use.
It is unclear whether this one is safe for use from multiple goroutines.
If you run net.test under lldb, that causes syslog messages to be
printed to standard error suggesting double-free bugs:

	2019-06-05 19:52:43.505246-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505274-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505303-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505329-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD

This res_search is from libsystem_info; a normal C program would
get res_search (#defined to res_9_search) from libresolv instead.
It is unclear what the relation between the two is.
Issue #12524 was about supporting the /etc/resolver directory tree,
but only libresolv contains code for that; libsystem_info does not.
So this code probably does not enable use of /etc/resolver.

In short:

 - Before this CL, the code clearly had never run successfully.
 - The code appears not to improve upon the usual non-cgo fallback.
 - The code carries with it no tests of improved behavior.
 - The code breaks existing tests.
 - Calling res_search does not work for PTR/CNAME queries,
   so the code breaks existing behavior, even after this CL.
 - It's unclear whether res_search is safe to call from multiple threads.
 - It's unclear whether res_search is used by any other macOS programs.

Given this, it probably makes sense to delete this code rather
than rejigger the test. This CL fixes the code first, so that there
is a working copy to bring back later if we find out that it really
is necessary.

For #31705.

Change-Id: Id2e11e8ade43098b0f90dd4d16a62ca86a7a244a
Reviewed-on: https://go-review.googlesource.com/c/go/+/180842
Run-TryBot: Russ Cox <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
@golang golang locked and limited conversation to collaborators Jun 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

10 participants