Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: system experienced long pauses during GC marking #27410

Closed
fmstephe opened this issue Aug 31, 2018 · 11 comments
Closed

runtime: system experienced long pauses during GC marking #27410

fmstephe opened this issue Aug 31, 2018 · 11 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge GarbageCollector NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@fmstephe
Copy link

fmstephe commented Aug 31, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.10.3 linux/amd64

Does this issue reproduce with the latest release?

We don't know. Unfortunately we can't run this exact experiment again.

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

We were running a tool migrating data between two database systems. It maintains two in memory caches (implemented as sharded-maps) which became very large. As the heap grew we experienced very uneven performance.

We ran a series of experiments capping the size of the caches to compare the system operation at different cache sizes.

We took traces of the running system to compare the behaviour and see if we could get a better understanding of what was causing the performance degradation.

Unfortunately we had to force the garbage collector to run during the trace to ensure that we would actually see it. So the traces contain an artificial GC cycle, which won't work exactly the same as a naturally occurring one.

What did you expect to see?

What did you see instead?

Repeated and dramatically long gaps during the mark phase, on the order of 10-20 milliseconds. During these periods no network events occurred, effectively pausing the system.

We have written up an account of what we saw here

https://docs.google.com/document/d/1-DO8jp9q0ONuo9nL1EXVm9unzMHLA0KUkrisAWxjaRo/edit?usp=sharing

My company is reluctant to attach traces directly here, as they contain private code stack traces. But we are happy to send them directly to individuals if they would like to see them.

@ALTree ALTree changed the title System experienced long pauses during GC marking runtime: system experienced long pauses during GC marking Aug 31, 2018
@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. GarbageCollector labels Aug 31, 2018
@ALTree ALTree added this to the Go1.12 milestone Aug 31, 2018
@josharian
Copy link
Contributor

@aclements @RLH

@ysmolski
Copy link
Member

ysmolski commented Sep 4, 2018

Mentioned google document cannot be accessed.

What was the approx. size of allocated heap?

@fmstephe
Copy link
Author

fmstephe commented Sep 4, 2018

@ysmolsky We experienced similar sized pauses during the mark phase. We tested heaps of size

8-13 GB
20-30 GB
60-70 GB

We saw a lot of 5-20ms pauses (where pauses are measured as a period where no network events are recorded in the trace). The larger heaps sizes saw a greater performance impact, but it looks like that was simply because the mark phase ran for much longer.

In particular we saw huge number of Idle-GC slices in the trace. If were to guess I would say the scheduler was struggling to schedule all of the Idle-GC goroutines as well and schedule meaningful work.

If you request access to the google doc, I will grant it as soon as I can.
I need to work out how to publish that without using google docs, as it makes reading the doc really hard.

@agnivade
Copy link
Contributor

Likely a dupe of #27732. @dr2chase ?

@fmstephe
Copy link
Author

We saw no Mark assists in the trace. I specifically looked for them, because that was my first thought.

With the way we forced the GC to run during the trace makes assists even less likely. Because an early GC run is less likely to feel like the mutators are outpacing it and ask for assists (is my understanding).

The unusual feature of this test that the system was running on a machine with 48 cores with GOMAXPROCS unset. It looks like it was trying to use all the cores during GC, but most GC slices are idle.

@fmstephe
Copy link
Author

fmstephe commented Oct 5, 2018

The blog post has been published

http://big-elephants.com/2018-09/unexpected-gc-pauses/

If anything written there is incorrect I am happy to make edits.

Should be easier to access than the google doc.

@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@dr2chase
Copy link
Contributor

Crud, I just now saw this.
Is this still a problem?
And how many processors are there, and is the Go process the only significant load on the box?
(I.e., is it possible that there is competition with other processes?)

@agnivade agnivade added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 13, 2020
@fmstephe
Copy link
Author

The particular program where we saw this behaviour was a one off. We ran it for about a week it was never run again. It more or less had the whole machine to itself.

This experience was somewhat accidental, in that the service tested here was deployed without a GOMAXPROCS so it naturally just used all the cores available. So we don't experience this generally today, or we are not aware of experiencing it.

If we see any behaviour like this again I'll be sure to post new information.

From my perspective I think that this issue should probably be closed, just out of pragmatism. The version of Go is out of date and there's no practical way for us to reproduce this today.

I am personally very interested about this behaviour so we may try to reproduce it in a different system, but if we do then I can post a new issue.

@fmstephe
Copy link
Author

Oh, something I forgot to mention above. I still have all the traces that were used to make that blog-post. If someone wants to dig into those I am very happy to send them.

@fmstephe
Copy link
Author

@dr2chase after re-reading your comment I realised I didn't answer one of your questions.

The system was running with 42 cores available.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@mknyszek mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022
@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Dec 26, 2022
@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@github-project-automation github-project-automation bot moved this from Triage Backlog to Done in Go Compiler / Runtime Jan 26, 2023
@golang golang locked and limited conversation to collaborators Jan 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge GarbageCollector NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests