-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: system experienced long pauses during GC marking #27410
Comments
Mentioned google document cannot be accessed. What was the approx. size of allocated heap? |
@ysmolsky We experienced similar sized pauses during the mark phase. We tested heaps of size 8-13 GB We saw a lot of 5-20ms pauses (where pauses are measured as a period where no network events are recorded in the trace). The larger heaps sizes saw a greater performance impact, but it looks like that was simply because the mark phase ran for much longer. In particular we saw huge number of Idle-GC slices in the trace. If were to guess I would say the scheduler was struggling to schedule all of the Idle-GC goroutines as well and schedule meaningful work. If you request access to the google doc, I will grant it as soon as I can. |
We saw no Mark assists in the trace. I specifically looked for them, because that was my first thought. With the way we forced the GC to run during the trace makes assists even less likely. Because an early GC run is less likely to feel like the mutators are outpacing it and ask for assists (is my understanding). The unusual feature of this test that the system was running on a machine with 48 cores with GOMAXPROCS unset. It looks like it was trying to use all the cores during GC, but most GC slices are idle. |
The blog post has been published http://big-elephants.com/2018-09/unexpected-gc-pauses/ If anything written there is incorrect I am happy to make edits. Should be easier to access than the google doc. |
Crud, I just now saw this. |
The particular program where we saw this behaviour was a one off. We ran it for about a week it was never run again. It more or less had the whole machine to itself. This experience was somewhat accidental, in that the service tested here was deployed without a GOMAXPROCS so it naturally just used all the cores available. So we don't experience this generally today, or we are not aware of experiencing it. If we see any behaviour like this again I'll be sure to post new information. From my perspective I think that this issue should probably be closed, just out of pragmatism. The version of Go is out of date and there's no practical way for us to reproduce this today. I am personally very interested about this behaviour so we may try to reproduce it in a different system, but if we do then I can post a new issue. |
Oh, something I forgot to mention above. I still have all the traces that were used to make that blog-post. If someone wants to dig into those I am very happy to send them. |
@dr2chase after re-reading your comment I realised I didn't answer one of your questions. The system was running with 42 cores available. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.10.3 linux/amd64
Does this issue reproduce with the latest release?
We don't know. Unfortunately we can't run this exact experiment again.
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
We were running a tool migrating data between two database systems. It maintains two in memory caches (implemented as sharded-maps) which became very large. As the heap grew we experienced very uneven performance.
We ran a series of experiments capping the size of the caches to compare the system operation at different cache sizes.
We took traces of the running system to compare the behaviour and see if we could get a better understanding of what was causing the performance degradation.
Unfortunately we had to force the garbage collector to run during the trace to ensure that we would actually see it. So the traces contain an artificial GC cycle, which won't work exactly the same as a naturally occurring one.
What did you expect to see?
What did you see instead?
Repeated and dramatically long gaps during the mark phase, on the order of 10-20 milliseconds. During these periods no network events occurred, effectively pausing the system.
We have written up an account of what we saw here
https://docs.google.com/document/d/1-DO8jp9q0ONuo9nL1EXVm9unzMHLA0KUkrisAWxjaRo/edit?usp=sharing
My company is reluctant to attach traces directly here, as they contain private code stack traces. But we are happy to send them directly to individuals if they would like to see them.
The text was updated successfully, but these errors were encountered: