Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: reducing preemption in suspendG when G is running large nosplit functions #40229

Open
shawndx opened this issue Jul 15, 2020 · 3 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@shawndx
Copy link
Contributor

shawndx commented Jul 15, 2020

What version of Go are you using (go version)?

$ go version
tip version, on arm64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env


GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/xiaji01/.cache/go-build"
GOENV="/home/xiaji01/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/xiaji01/.go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xiaji01/.go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/xiaji01/src/go.gc"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/xiaji01/src/go.gc/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/xiaji01/src/go.gc/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build944660995=/tmp/go-build -gno-record-gcc-switches"

What did you do?

live-lock problems are spotted on arm64 in a couple of micro benchmarks from text/tabwriter (there should be more similar cases).
If a goroutine is running 'nosplit' functions which are time consuming, like
bulkBarrierPreWriteSrcOnly
memmove
called in growslice when the slice is a large one, and its background worker tries to suspend it to do stack scanning, the G under preemption may hardly make progress, especially on arm64 machines which seem to experience poor signal handling performance.

pprof data of BenchmarkTable/1x100000/new on an arm64 machine:

(pprof) top
Showing nodes accounting for 12.46s, 95.70% of 13.02s total
Dropped 118 nodes (cum <= 0.07s)
Showing top 10 nodes out of 40
      flat  flat%   sum%        cum   cum%
     5.41s 41.55% 41.55%      5.41s 41.55%  runtime.futex
     3.22s 24.73% 66.28%      3.59s 27.57%  runtime.nanotime (inline)
     0.99s  7.60% 73.89%      0.99s  7.60%  runtime.tgkill
     0.84s  6.45% 80.34%      0.84s  6.45%  runtime.osyield
     0.77s  5.91% 86.25%      0.77s  5.91%  runtime.epollwait
     0.37s  2.84% 89.09%      0.37s  2.84%  runtime.nanotime1
     0.33s  2.53% 91.63%      6.38s 49.00%  runtime.suspendG
     0.30s  2.30% 93.93%      0.30s  2.30%  runtime.getpid
     0.14s  1.08% 95.01%      0.14s  1.08%  runtime.procyield
     0.09s  0.69% 95.70%      1.38s 10.60%  runtime.preemptM (inline)

I'm thinking to introduce a flag into G to indicate whether it's running a time-consuming nosplit function and let suspendG yield to avoid the live-lock, the flag is set manually for known functions, I tried it for bulkBarrierPreWriteSrcOnly and memmove in runtime.growslice and runtime.makeslicecopy when the slice is longer than a threshold (select 4K for now), the text/tabwriter package witnessed significant improvement on arm64:

arm64-1 perf:

name                        old time/op    new time/op    delta
Table/1x10/new-224            6.59µs ± 8%    6.67µs ± 7%     ~     (p=0.841 n=5+5)
Table/1x10/reuse-224          1.94µs ± 1%    1.94µs ± 1%     ~     (p=0.802 n=5+5)
Table/1x1000/new-224           494µs ±13%     381µs ± 3%  -22.85%  (p=0.008 n=5+5)
Table/1x1000/reuse-224         185µs ± 0%     185µs ± 0%   -0.37%  (p=0.016 n=5+4)
Table/1x100000/new-224         2.73s ±73%     0.04s ± 2%  -98.46%  (p=0.008 n=5+5)
Table/1x100000/reuse-224       4.25s ±76%     0.02s ± 1%  -99.55%  (p=0.008 n=5+5)
Table/10x10/new-224           20.7µs ± 6%    20.1µs ± 5%     ~     (p=0.421 n=5+5)
Table/10x10/reuse-224         8.89µs ± 0%    8.91µs ± 0%   +0.21%  (p=0.032 n=5+5)
Table/10x1000/new-224         1.70ms ± 7%    1.59ms ± 6%     ~     (p=0.151 n=5+5)
Table/10x1000/reuse-224        908µs ± 0%     902µs ± 0%   -0.59%  (p=0.032 n=5+5)
Table/10x100000/new-224        2.34s ±62%     0.14s ± 2%  -94.21%  (p=0.008 n=5+5)
Table/10x100000/reuse-224      911ms ±72%     103ms ± 2%  -88.66%  (p=0.008 n=5+5)
Table/100x10/new-224           143µs ±15%     138µs ± 5%     ~     (p=0.841 n=5+5)
Table/100x10/reuse-224        78.1µs ± 0%    78.1µs ± 0%     ~     (p=0.841 n=5+5)
Table/100x1000/new-224        13.1ms ± 6%    11.9ms ± 5%   -9.16%  (p=0.008 n=5+5)
Table/100x1000/reuse-224      8.13ms ± 0%    8.15ms ± 1%     ~     (p=1.000 n=5+5)
Table/100x100000/new-224       1.29s ±17%     1.31s ±10%     ~     (p=0.310 n=5+5)
Table/100x100000/reuse-224     1.23s ± 2%     1.26s ± 5%     ~     (p=0.286 n=4+5)

arm64-2 perf:

name                       old time/op    new time/op    delta
Table/1x10/new-64            4.33µs ± 1%    5.40µs ± 4%  +24.92%  (p=0.004 n=6+5)
Table/1x10/reuse-64          1.77µs ± 0%    1.77µs ± 0%   -0.36%  (p=0.024 n=6+6)
Table/1x1000/new-64           318µs ±10%     379µs ±10%  +19.04%  (p=0.008 n=5+5)
Table/1x1000/reuse-64         172µs ± 0%     171µs ± 0%     ~     (p=0.537 n=6+5)
Table/1x100000/new-64         4.26s ±88%     0.05s ± 4%  -98.77%  (p=0.002 n=6+6)
Table/1x100000/reuse-64       4.79s ±73%     0.02s ± 2%  -99.61%  (p=0.004 n=6+5)
Table/10x10/new-64           14.6µs ± 3%    16.4µs ± 9%  +12.51%  (p=0.004 n=5+6)
Table/10x10/reuse-64         8.76µs ± 0%    8.79µs ± 0%     ~     (p=0.329 n=5+6)
Table/10x1000/new-64         1.18ms ± 3%    1.32ms ± 4%  +12.12%  (p=0.002 n=6+6)
Table/10x1000/reuse-64        890µs ± 0%     897µs ± 0%   +0.84%  (p=0.002 n=6+6)
Table/10x100000/new-64        1.43s ±36%     0.16s ± 3%  -88.56%  (p=0.004 n=5+6)
Table/10x100000/reuse-64      375ms ±55%     138ms ±11%  -63.31%  (p=0.004 n=5+6)
Table/100x10/new-64           103µs ± 0%     105µs ± 4%     ~     (p=0.429 n=5+6)
Table/100x10/reuse-64        79.6µs ± 0%    80.0µs ± 1%     ~     (p=0.329 n=5+6)
Table/100x1000/new-64        10.2ms ± 3%    10.5ms ± 1%     ~     (p=0.052 n=6+5)
Table/100x1000/reuse-64      8.75ms ± 0%    9.23ms ± 6%     ~     (p=0.126 n=5+6)
Table/100x100000/new-64       1.42s ± 6%     1.46s ±11%     ~     (p=0.394 n=6+6)
Table/100x100000/reuse-64     1.40s ± 0%     1.45s ± 6%     ~     (p=0.690 n=5+5)
Pyramid/10-64                15.3µs ±45%    14.5µs ±20%     ~     (p=0.662 n=6+5)
Pyramid/100-64               1.16ms ±13%    0.88ms ±11%  -23.90%  (p=0.002 n=6+6)
Pyramid/1000-64              71.6ms ± 8%    77.8ms ±12%     ~     (p=0.056 n=5+5)
Ragged/10-64                 13.2µs ±17%    14.0µs ± 2%     ~     (p=0.792 n=6+5)
Ragged/100-64                 104µs ± 3%     123µs ± 2%  +18.21%  (p=0.008 n=5+5)
Ragged/1000-64               1.12ms ±12%    1.35ms ± 9%  +20.55%  (p=0.002 n=6+6)
Code-64                      3.51µs ± 1%    3.80µs ± 3%   +8.46%  (p=0.004 n=5+6)

x86 perf:

name                       old time/op    new time/op    delta
Table/1x10/new-32            5.36µs ± 2%    5.58µs ± 3%   +4.03%  (p=0.004 n=6+6)
Table/1x10/reuse-32          1.50µs ± 8%    1.48µs ± 0%     ~     (p=0.433 n=6+5)
Table/1x1000/new-32           318µs ± 2%     355µs ± 2%  +11.43%  (p=0.004 n=5+6)
Table/1x1000/reuse-32         127µs ± 0%     134µs ± 1%   +5.36%  (p=0.004 n=6+5)
Table/1x100000/new-32        49.9ms ± 3%    48.1ms ± 3%   -3.57%  (p=0.026 n=6+6)
Table/1x100000/reuse-32      13.8ms ± 1%    15.2ms ± 1%   +9.87%  (p=0.002 n=6+6)
Table/10x10/new-32           17.4µs ± 3%    17.6µs ± 2%     ~     (p=0.310 n=6+6)
Table/10x10/reuse-32         7.20µs ± 1%    7.21µs ± 0%     ~     (p=0.429 n=5+6)
Table/10x1000/new-32         1.38ms ± 1%    1.43ms ± 2%   +4.16%  (p=0.002 n=6+6)
Table/10x1000/reuse-32        687µs ± 1%     693µs ± 0%     ~     (p=0.052 n=6+5)
Table/10x100000/new-32        131ms ± 2%     133ms ± 5%     ~     (p=0.699 n=6+6)
Table/10x100000/reuse-32     89.2ms ± 2%    90.3ms ± 2%     ~     (p=0.177 n=5+6)
Table/100x10/new-32           122µs ± 1%     122µs ± 1%     ~     (p=0.632 n=6+5)
Table/100x10/reuse-32        62.3µs ± 0%    62.4µs ± 0%     ~     (p=0.429 n=5+6)
Table/100x1000/new-32        12.2ms ± 4%    12.0ms ± 3%     ~     (p=0.180 n=6+6)
Table/100x1000/reuse-32      6.29ms ± 0%    6.32ms ± 0%   +0.54%  (p=0.015 n=6+6)
Table/100x100000/new-32       1.01s ± 2%     1.00s ± 4%     ~     (p=0.429 n=6+5)
Table/100x100000/reuse-32     972ms ±10%     962ms ±18%     ~     (p=1.000 n=6+6)
Pyramid/10-32                14.3µs ± 4%    14.8µs ± 6%     ~     (p=0.240 n=6+6)
Pyramid/100-32                833µs ± 2%     841µs ± 2%     ~     (p=0.394 n=6+6)
Pyramid/1000-32              53.5ms ± 1%    56.2ms ± 3%   +4.98%  (p=0.004 n=5+6)
Ragged/10-32                 15.1µs ± 0%    14.8µs ± 1%   -1.75%  (p=0.008 n=5+5)
Ragged/100-32                 127µs ± 4%     130µs ± 2%     ~     (p=0.180 n=6+6)
Ragged/1000-32               1.27ms ± 5%    1.33ms ± 3%   +3.96%  (p=0.026 n=6+6)
Code-32                      3.75µs ± 1%    3.81µs ± 2%   +1.67%  (p=0.048 n=5+6)

The potential overhead is acquiring the current G in the two slice functions, which seem to be tiny.
I'm working on benchmarking more packages and evaluate its impact to x86 platform.

Any comment is highly appreciated.

What did you expect to see?

Better performance.

What did you see instead?

@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 15, 2020
@ianlancetaylor
Copy link
Member

CC @aclements @cherrymui

@cherrymui
Copy link
Member

cherrymui commented Jul 15, 2020

I think the plan is the opposite: make functions like memmove preemptible. See also #36365

@shawndx
Copy link
Contributor Author

shawndx commented Jul 17, 2020

I think the plan is the opposite: make functions like memmove preemptible. See also #36365

Thanks @cherrymui , do we have a (tentative) scheduled version to have it?

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@mknyszek mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022
@seankhliao seankhliao added this to the Unplanned milestone Aug 27, 2022
@seankhliao seankhliao changed the title runtime: request for comment: reducing preemption in suspendG when G is running large nosplit functions runtime: reducing preemption in suspendG when G is running large nosplit functions Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
Development

No branches or pull requests

6 participants