-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c #68399
Comments
1.20 isn't a supported release, can we see the results from the latest version? |
Yes, reproducing on a newer Go version would make it easier to analyze the problem and perhaps to fix. It's interesting to see Maybe the GC is running very often, which in turn has other effects that cause the app's increased tail latency. In that case, arranging for the GC to run less often (without changes to Go) might resolve your application's problems. Or, looking more deeply into the tail latency itself might show something about Go. Execution traces (the runtime/trace package) give a more direct view into latency. One of the behaviors I've seen in the scheduler is that when there are a lot of Ps (higher GOMAXPROCS, since that's the number of It's interesting to see that your application ends up with higher latency. As I understand Go's scheduler, running a program with higher GOMAXPROCS would lead to higher CPU usage due to idle threads searching for work rather than going to sleep. But in a single-tenant environment -- with no other (big) programs running on the machine -- I'd expect the effect of that to be limited to keeping the CPUs powered up for longer than necessary, to waste electricity but to still have good latency in the app. You mentioned AWS EKS, which makes me wonder if the app is running alongside other Go apps on the same machine/VM. If so, the increase in on-CPU times in the other Go apps (from the effect I described above) could be what's causing the latency. It looks like the flame graph tool is reordering the functions based on their contribution to the on-CPU time total (rather than remaining alphabetical), which makes the two charts a bit hard to compare. Especially with the names being truncated. Sharing the data in a less processed form may be helpful, maybe the svg/web/png view of The Looking at growth of the |
Sorry it should be go1.19.13, my bad, and thanks for the rapid response! |
Here are the flame graphs: (*.proto) |
Thanks for sharing the data. Those profiles work well with I see that the Build ID for those two profiles are identical (e3ac319ed9aafff97d9ecf080dc2d206dd998382). I see that the amount of on-CPU time they spend in several types of application goroutines is similar between the two profiles -- looks like they're receiving the same total amount of offered load (so half as much per core on the 64-core machine).
From spot-checking differences in on-CPU time, I see that the 64-core profile shows more time in Another overall takeaway from the CPU profiles is that the 32-core profile shows an average of about 10 cores of on-CPU time, and the 64-core profile shows an average of about 17 cores of on-CPU time. These don't appear (on average at least) to be starved for CPU time.
With no other pods scheduled on the machine? The change in how the program spends its CPU time is suspicious, but I don't see how that could connect to a delay in processing application work. When many Ps are in I do see a difference in the amount of GC Assist work: from There may be two or more problems here. One relating to GC Assist, which could be a data point for Go's pacer design or could be worked around within your app. Another relating to the high on-CPU time in Some thoughts on how to make analysis easier: The golang.org/x/exp/trace.FlightRecorder may help to capture an execution trace that shows the problem. It requires Go 1.22. CC @golang/runtime, and (pacer!? or at least GC worker management) @mknyszek |
Appreciate the comprehensive response! Since this issue happened on product env, we can't simply change the online app (change to go1.22 build), If we are lucky, I should come back with new captured data in days. Thanks again! |
However, if that is the case, it is surprising we don't see other parts of the scheduler, for when the GC is not running. |
Hi all, we ran some tests, and I am back with some test results and captured data. Tests we did:
Some things we found in the results:
And we captured flames with go1.22:
|
Thanks for the update. Are you able to share the execution traces you collected? Looking inside the zip files for the 32c and 64c experiments: The ./goroutine files make it look like the 32c process had been running for 7 minutes at the time of the snapshot and the 64c process had been running for 11 minutes. The ./mutex file shows 557 seconds of Using the CPU profiles in ./profile as a very rough estimate of which runtime-internal locks are contended via This looks like evidence in favor of @prattmic 's suspicion that the program is entering the scheduler extremely frequently. That might be due to behavior of app-specific code, or due to behavior of a package the app uses (such as http2/grpc). The 64c ./mutex profile shows 165e6 contention events taking 3500 seconds total, for an average of 21 µs per event. For 32c, it's 57e6 and 557, for an average of 9.8 µs per event. Those two processes also had a difference in the amount of An execution trace, preferably one that was captured concurrently with an active CPU profile, would give a more complete view of what the app is doing and why it might be having trouble, rather than needing to piece bits together from the other views. I have an idea for how to address the high CPU time of |
Change https://go.dev/cl/602477 mentions this issue: |
During the GC mark phase, one of the first behaviors of findRunnable is to check if it should execute a GC mark worker. Mark workers often run for many milliseconds in a row, so programs that invoke the scheduler more frequently will see that condition trigger only a tiny fraction of the time. Obtaining a mark worker from the gcBgMarkWorkerPool involves a CAS on a single memory location that's shared across the process. When GOMAXPROCS is large, the resulting contention can waste a significant amount of CPU time. But a sufficiently large GOMAXPROCS also means there's no need for fractional mark workers, making it easier to check ahead of time if we need to run a worker. Check, without committing to a particular worker, whether we would even want to run one. For #68399 Change-Id: I5d8578c2101ee20a8a4156a029584356095ea118 Reviewed-on: https://go-review.googlesource.com/c/go/+/602477 Reviewed-by: Michael Pratt <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
I think that https://go.dev/cl/602477 will fix the headline issue of "findRunnableGCWorker using more CPU on larger machines". It's been merged into the development branch, so should be part of the Go 1.24 release in about 6 months. If you're able to run your workload with a patched version of Go, it would be helpful to know how it changes the behaviors you've reported. I don't think it will solve the higher tail latencies you've seen on larger machines (though it is a step towards that). I think the reason you're reporting the increase in findRunnableGCWorker's CPU usage is that your app is 1/ running on a large machine, and 2/ interacting with the scheduler very, very often. The CPU usage of findRunnableGCWorker is the easiest to see (on CPU profiles) and perhaps to fix, but there'll be others. The various sources of contention on The app might benefit from app-level or library-level changes to reduce how often it interacts with the scheduler. But I think the scheduler will benefit from continued issue reporting here. Thank you for reporting, and sharing the detailed data. |
Thank you guys a lot! |
Hi @rhysh, Thank you for your investigation. Here is the trace file for #68399 (comment). with go.1.19 (The time trace captured in the yellow section of the test corresponds to the first image of the comment.) trace file link. If it is helpful or needs more data I can provide more by reproducing it within my environment.
So it could be a common issue. BTW, I also try to reproduce it in another environment with the same methods, but the runtime shows a higher cost(almost 5000% CPU with 64c). here is the profile: |
Thanks @nolouch . I'm not so worried about The go1.22 profile you shared today from "another environment" shows a very high level of contention on But, it also shows a reason for the contention on Thanks for sharing the execution traces. They're pretty big, and aren't opening quickly on my machine. (They're in the Go 1.19 format ... I think that fast loading of large traces should be better with the v2 tracer that shipped with Go 1.22.) I'll give a more thorough try later. But if you have them from a Go 1.22+ version of the program, that would probably work even better. |
I moved Using go1.23rc2's If we were to average across GOMAXPROCS=64, that would be a transition into "Running" every 44 µs on each thread. But the threads aren't all running all the time, so the "Running" events are bunched together more closely while a thread is active. So a typical goroutine execution lasts for less than 44 µs. Spot-checking between 760 ms and 860 ms, I see many goroutines that run for single-digit microseconds. So yes, this is an app that interacts with the scheduler very frequently. I don't see any CPU profile sample events in the execution trace data stream. During some millisecond-scale time windows, such as between 809.5 ms and 810.5 ms (pictured below), I see many You're looking for something that slows the p99 down by a few milliseconds. This is something occasional and unusual that lasts on the order of a millisecond. If you're having trouble getting a CPU profile and execution trace to run at the same time, you might try using https://pkg.go.dev/github.com/rhysh/autoprof . Configure its |
@rhysh
The patch has some minor improvements, but it's unclear whether the improvement is due to the PR changes or other changes. also, the difference of tail latency between 32c and 64c is still significant. |
That's great data, thank you! From the bit that I've been able to digest so far, the worst of it looks (at a high level) like a single behavior of your app is causing it to bump into a combination of a few suboptimal behaviors in the runtime. My read of the autoprof data bundles is that your app is running with GOMAXPROCS of 32 or 64 and is serving a high volume of latency-sensitive requests (which involves frequent allocation), but has a live heap of only 100–150 MB. The GC is running every 120–205 ms, with the mark phase lasting around 15 ms. If there's nothing else running on those machines, there's probably 60 GB or more of idle RAM that the process could use to spread out the time between its garbage collections. I'd start slow, by inviting the runtime to use 10 GB of it for garbage. You can do that by running with Another option, strange as it sounds, is to call I didn't see the behaviors that showed up in the other profiles/traces, and I haven't looked into why. I'm concerned that there are several concurrent problems, and that if we try to eliminate only a single one at a time we might conclude that none of them—by itself—fixes your app's tail latency. I'd expect the lfstack usage in From the runtime side, I see high levels of GC Assist work. That shows up as Then, I see that there's a lot of on-CPU time in the lfstack that's used by Slowness during each GC cycle, lasting up to 15 ms and occurring every 150 ms, could explain tail latency beyond the p90. Setting a max heap size of 10 GiB would make GC cycles happen about 100x less often. (I'm surprised that the GOGC=1000 experiment didn't work better.) Incidentally, I saw that the autoprof data is going into a I used to have tools (in https://pkg.go.dev/github.com/rhysh/go-tracing-toolbox) to extract examples of slow requests from the execution trace, but I haven't ported them to the new trace parsing API (or added grpc support). Maybe I'll change that over the next few weeks. |
Sorry for the lack of reply here. I tried to read over all the comments, but it's possible I missed something. Apologies in advance if I end up restating something that was already mentioned above. I think Rhys has done a great analysis; below I respond primarily to the GC assist questions, though I think a lot of this is relevant to most of the details of Rhys' most recent comment.
An increase in assist work with increased Here are a couple theories as to why there's additional assist work: (1) The GC measures its own performance. If we're hitting a scalability bottleneck in the GC (like below), then the GC will measure lower performance, causing it to be more conservative and force more assists to keep up. The GC should still be starting earlier to offset that, but it's possible you're running up against some of the internal limits as to how early a GC is allowed to start running. The (2) Does each 64-core instance of your application accept a bigger load? (Like, when you scaled up to 64 cores from 32, did you also start using half the instances?) Also, what is the difference in memory footprint with the 32-core and 64-core version? I ask these questions because one way you can end up in this situation is due to how Let me try to illustrate: 32-core -> 2k QPS results in 500 MiB/s allocated. With a 100 MiB live heap, 200 MiB total heap footprint -> 1 GC cycle every 200 ms. Since in this example the live heap is constant, each GC cycle is doing a constant amount of work. But now your GC overhead is 2x! The fix to this would be to increase Honestly, I suspect there's probably a bit of both (1) and (2) going on. Also, I'd recommend moving to at least Go 1.20, since Go 1.19 had a somewhat more erratic pacer which could lead to more assists in more edge cases. Having a look at the extra STDERR output from your program when running with
There are definitely some |
Some time ago, we encountered a similar problem, and we also located
|
Sorry for the delay in my response. GOGC has already been tested, and while it provides some benefits as noted in #68399 (comment), it doesn't perform as well as using 32 cores. I will also try using the |
Hi @mknyszek , we were facing performance issues with lfstack on arm64 as well and while searching for it stumbled on this thread
Are there some readings I can refer to for bottlenecks that you mentioned in previous message ? |
Hi @YashJainSC, I'm not 100% sure, but as a starting point you could take a look at #21056, including #21056 (comment) and https://go.dev/cl/62971, and a benchmark you could run if interested in #21056 (comment):
|
Go version
go1.19.13 linux/arm64
Output of
go env
in your module/workspace:GOARCH="arm64"
What did you do?
What did you see happen?
findrunnableGCWorker
consumed much higher CPU comparing to previous.Here are the graphs:
What did you expect to see?
The latency of this server process shouldn't change much when we changed from 32c instance to 64c instance.
The text was updated successfully, but these errors were encountered: