Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: tests timing out on aix builder #29065

Closed
tklauser opened this issue Dec 2, 2018 · 19 comments
Closed

cmd/go: tests timing out on aix builder #29065

tklauser opened this issue Dec 2, 2018 · 19 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge GoCommand cmd/go OS-AIX Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@tklauser
Copy link
Member

tklauser commented Dec 2, 2018

The cmd/go tests are timing out on the aix/ppc4 builder:

https://build.golang.org/log/86480370eb2fba22d0f47458ad1fecf8ce9beea7

It looked similar to the Plan 9 failures reported in #29033 at first, but @bcmills suggested to track this in a different issue as the lockedfile implementation for plan9 is different from the other platforms.

@Helflym mentioned in #29033 (comment) that the timeout is only triggered on the builder, but not on a fresh install fetched from git. So there might be some issue with the builder.

Also, it seems the last few commits haven't been built by the aix/ppc64 builder which also indicates some issue with the aix builder.

/cc @bradfitz

@tklauser tklauser added Builders x/build issues (builders, bots, dashboards) OS-AIX labels Dec 2, 2018
@Helflym
Copy link
Contributor

Helflym commented Dec 3, 2018

First, I've opened issue 29078 which might be the problem with cmd/go timeout.

However, currently the builder is stuck in a cmd/compile at:

(gdb) bt
#0  0x0000000100023210 in runtime.(*gcWork).checkPut (w=0x100025598 <runtime.(*mheap).allocManual+56>) at /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgcwork.go:120
#1  0x00000001000234d8 in runtime.(*gcWork).putBatch (w=0xe8c10088e8e10090, obj=...) at /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgcwork.go:191
...

(I don't have the rest of the backtrace, as gdb isn't fully working yet)

I'm not sure this is related to the PATH problem. A few crashes that occurred seems linked with a problem inside the runtime:
https://build.golang.org/log/a7041ed8be3d4c3c9f21bf367590a69e52d460de

Morever, is there a way to manually trigger a build ? I could make some experiments to figuring out what's wrong. If not, I'll let the builder stuck until PATH is changed. This will avoid unnecessary fails and provides more tests that everything was related to PATH afterwards.

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2018

It appears that the aix builder is down, so I also can't get a gomote session to it to investigate further.

I did try running go test cmd/go/internal/lockedfile/... on it last week, and those tests failed because /tmp is apparently mounted in a way that does not support flock-based file locks. I'm not sure whether that's a configuration issue or a fundamental limitation of the OS.

I tried running go test -short -timeout=1m cmd/go, but that ended up hung at this line:

return <-errc

That suggests that either there is a serious hang in cmd/go or a bug in the aix kernel, perhaps regarding the interaction of signals and file-locks?

@bcmills bcmills added the GoCommand cmd/go label Dec 3, 2018
@bcmills bcmills added this to the Go1.12 milestone Dec 3, 2018
@bcmills bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Dec 3, 2018
@Helflym
Copy link
Contributor

Helflym commented Dec 3, 2018

aix builder is in a infinite loop currently. I'd rather kill the hanged process, once the new PATH has been provided. Except if it doesn't need an update of x/cmd/buildlet in this case, I'll release it as soon as you want.

Moreover, the new package cmd/go/internal/lockedfile isn't AIX friendly. AIX flock isn't blocking if it's called twice by the same thread. I'm already working on it.

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2018

AIX flock isn't blocking if it's called twice by the same thread. I'm already working on it.

That's frustrating, but if you have a POSIX fcntl implementation available, you might be able to just rename filelock_solaris.go and constrain it to //+build aix solaris.

@Helflym
Copy link
Contributor

Helflym commented Dec 3, 2018

Yes, it do have it but filelock_solaris.go seems even more constrained. I'm not sure AIX will need the whole inode stuff. It seems to work with a mutex above every Flock calls. (It needs few adjustments though).
By the way, syscall.Flock is calling fcntl on AIX, cf flock_aix.go.

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2018

By the way, syscall.Flock is calling fcntl on AIX, cf flock_aix.go.

That seems dangerous: at least on other platforms, flock and fcntl normally release their locks a different points (flock when the first descriptor cloned from the same open call is closed, and fcntl when the first descriptor referring to the same underlying file is closed).

FWIW, we rejected a similar implementation for Solaris (#24684). We should probably remove the one for AIX before it gets released: we can always add it back later if needed. I'll open a separate issue.

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2018

We should probably remove the one for AIX before it gets released

Filed as #29084.

@Helflym
Copy link
Contributor

Helflym commented Dec 3, 2018

Ok thanks.
By the way, I have tried solaris version of filelock and it does work.

@Helflym
Copy link
Contributor

Helflym commented Dec 3, 2018

By the way, aix builder is released. If you wish to do some experiments, you can. I'll provide the patch to use Posix version of lockedfile on AIX tomorrow.

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2018

Great, thanks!

Let's rename the file to filelock_fcntl.go. (I had originally named it filelock_posix.go but @rsc found that a bit confusing; see this comment.)

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/152397 mentions this issue: syscall, cmd/go/internal/lockedfile: remove Flock syscall for aix/ppc64

gopherbot pushed a commit that referenced this issue Dec 4, 2018
AIX doesn't provide flock() syscall, it was previously emulated by fcntl
calls. However, there are some differences between a flock() syscall and
a flock() using fcntl. Therefore, it's safer to remove it and just
provide FcntlFlock.

Thus, lockedfile implementation must be moved to use FcntlFlock on aix/ppc64.

Updates #29065.
Fixes #29084.

Change-Id: Ic48fd9f315f24c2acdf09b91d917da131a1f2dd5
Reviewed-on: https://go-review.googlesource.com/c/152397
Reviewed-by: Tobias Klauser <[email protected]>
Reviewed-by: Bryan C. Mills <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/152457 mentions this issue: cmd/buildlet: add /opt/freeware/bin to PATH on aix builder

@Helflym
Copy link
Contributor

Helflym commented Dec 5, 2018

I've made some investigations and the timeouts seem to come from different issues.

The main problem is the runtime being stuck in an infinite loop during runtime.(*gcWork).checkPut. This was introduced to resolve #27993. This happens mainly during a cmd/link or a cmd/compile launched by go build. Only once in a while though.
I currently have a cmd/link stuck with gcWorkPauseGen = 3 and w.pauseGen = 3:

 (gdb) bt
 #0  0x0000000100022dc0 in runtime.(*gcWork).checkPut (w=0x100025414 <runtime.(*mheap).allocSpanLocked+276>) at /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgcwork.go:120
 #1  0x0000000100023078 in runtime.(*gcWork).putBatch (w=0xe8c10088e8e10090, obj=...) at /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgcwork.go:191

(Sadly, I don't have more backtrace as gdb isn't fully working on aix/ppc64...)

I've also found out different crashes which seem to be related to P objects:

I think every think is linked to AIX syscalls that are made by calling 'asmcgocall' and C functions. How does GC know that a thread is in a C syscall ? It's for possible that I've forgotten to add a GOOS="aix" somewhere to handle syscalls correctly, even if I've check several times.

Morever, we are aware that the test machine has a really really slow local disk. That's why we have created a ramdisk to store the build folder. The person who have provided the machine is trying to find a way to fix this slowness. However, it might be interesting to change TMPDIR and GOTMPDIR to /ramdisk0/tmp. This might accelerate the whole process.

I still don't understand why these timeouts occur everytime within the builder but never when I manually launch all.bash... Are there many differences between both ?
Note that during the manual launches, cmd/go takes ~227s and net ~100s. Net is strange because during most of the builds made by the builder it takes only 50s, but sometimes it timeouts too:

If anyone has ideas about the origin of these bugs, you're welcome !

/cc @aclements

gopherbot pushed a commit to golang/build that referenced this issue Dec 5, 2018
Updates golang/go#29065
Fixes golang/go#29078

Change-Id: Ifa9355c9dc988a460b6198913431647ec2c5e6ac
Reviewed-on: https://go-review.googlesource.com/c/152457
Reviewed-by: Brad Fitzpatrick <[email protected]>
@Helflym
Copy link
Contributor

Helflym commented Dec 12, 2018

Tests are still timing out even after the new PATH CL 152457 and the remove of Flock in cmd/go/internal/lockedfile CL 152397. There are still some problems with the last one, but it does trigger an error and not a timeout.

I've been searching for a few days and still don't understand what's wrong with the builder. How far the environment or the execution are different from a classic ./all.bash ?
I've tried with fresh git clones and it never times out in cmd/go tests. It does time out sometimes because of the issue that I've spoken above (with runtime.(*gcWork).checkPut) but it's not that common anyway.

The timeout during cmd/go tests seems to always be related to TestScript/script_wait test. sleep process seems to never be killed... Maybe, there is something wrong with signals on AIX. But it's really strange that it never occurs while running ./all.bash. There is no error handler in script_test.go:interruptProcess(), I'll submit a patch which might give us more information about why sleep isn't killed.

Edit: Is there a way to try a commit only on one builder ? Adding the error handler isn't that simple to implement because of channels. It might be a little too heavy for just a supposition on one GOOS... The patch is ready but if I can only try it on aix/ppc64 that could be better.

@Helflym
Copy link
Contributor

Helflym commented Dec 12, 2018

I've found why sleep process isn't kill at the end of script_wait.
Sleep process is ignoring INT signal. It seems to inherit that from buildlet command which is also ignoring INT signal. Is something done on purpose by cmd/buildlet ?
I've made a simple test and the signal masks inheritance looks like the same on AIX and Linux, by the way. Do Linux (or others OSes) return an error if a signal is sent to a processing ignoring it ? I haven't found anything about that.

@bradfitz
Copy link
Contributor

It seems to inherit that from buildlet command which is also ignoring INT signal. Is something done on purpose by cmd/buildlet ?

I have no memory of doing such a thing on purpose (or accident).

@Helflym
Copy link
Contributor

Helflym commented Dec 13, 2018

Well, it seems that this sigignore came from how we were launching our builder. It should be ok now. We'll see if it works during the next commit !

@aclements
Copy link
Member

The main problem is the runtime being stuck in an infinite loop during runtime.(*gcWork).checkPut.

If some thread is stuck on checkPut, then the real question is what's going on on the other threads. There should be another thread running gcMarkDone, since the condition checkPut is spinning on is only true while gcMarkDone is running. If, as you mention, there are issues with stopping Ps on aix (in stopTheWorld and forEachP), then that's probably the root cause, since gcMarkDone has to stop all of the Ps before it can release the spin loop in checkPut.

@bcmills
Copy link
Contributor

bcmills commented Jan 17, 2019

It should be ok now. We'll see if it works during the next commit !

The only tests I'm seeing timing out lately are for the net package: whatever was affecting cmd/go seems to be resolved.

@bcmills bcmills closed this as completed Jan 17, 2019
@golang golang locked and limited conversation to collaborators Jan 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge GoCommand cmd/go OS-AIX Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants