Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC mark-loop rewrite #45608

Closed
wants to merge 24 commits into from
Closed

GC mark-loop rewrite #45608

wants to merge 24 commits into from

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Jun 7, 2022

This PR simplifies the GC mark-loop code. Some benchmark results are shown here: https://hackmd.io/XZDp98lTR3qyYgsWKCTvnQ. I believe these changes could bring some performance benefits as well (particularly if used with prefetching in the mark-loop, c.f. Cher 2004).

Previous work

Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found.

This PR

This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of jl_value_t * currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the remset, and we ensure this by calling gc_mark_outrefs in gc_queue_remset with meta_updated set to 1.

Additional advantages

  1. There are no recursive function calls in the GC mark-loop code (one of the reasons why Implement GC mark loop #21590 was implemented).
  2. Keeping a single GC queue will greatly simplify work-stealing in the multi-threaded GC we are working on (c.f. [rfc] parallel marking #45639).
  3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
  4. We can easily modify the traversal mode (to breadth first, for example) by only changing the jl_gc_markqueue_t(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.

TODO

  • Fix GC debugging infrastructure.
  • Run more benchmarks.
  • Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

CC: @vchuravy, @chflood, @kpamnany

@vchuravy vchuravy requested review from vtjnash, chflood and yuyichao June 7, 2022 21:42
@yuyichao
Copy link
Contributor

yuyichao commented Jun 7, 2022

To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

So IIUC this changed from using a depth first search (DFS) to a breadth first search (BFS). The main reason the marking is done in the current way was because of @vtjnash 's concern on the memory consumption. It might not be trivial to measure but I assume it's mainly affecting cases when there are complex objects and large arrays (i.e. things with lots of objects).

@d-netto
Copy link
Member Author

d-netto commented Jun 7, 2022

If your work-list is a stack, still DFS.

But yes, this could increase memory consumption for the GC queues in the case of very large arrays (and a single threaded collector) you mentioned above. Not sure if the memory tradeoff is that obvious for a parallel GC (some work-stealing queues are implemented as circular buffers).

@yuyichao
Copy link
Contributor

yuyichao commented Jun 8, 2022

BFS was the wrong name I guess, but this is the order that I was trying to avoid for the current version and the only reason it's done this way (cuz just collecting all the reference is of course the more natural and easy way to do it.......). Strictly maintaining the order makes sure this is not going to be any more of an issue compared to the version it replaced.

I'm not sure how much the impact would be (it should be larger by the number of references though of course the memory per object is less) since I don't know the average number of references per object in a typical workload. It is quite important to make sure the single thread case is not affected by much, especially regarding any worst cases...

@jpsamaroo jpsamaroo added the GC Garbage collector label Jun 8, 2022
@chflood
Copy link
Member

chflood commented Jun 8, 2022

I believe this is a good change. We probably want to add some special handling code for chunking big arrays at some point. I ran this and on my machine list.jl got significantly worse:
Before:
test = "list.jl"
run time: 4829ms min, 7394ms max 6081ms median
gc time: 3060ms min, 5584ms max, 4337ms median

After:
test = "list.jl"
run time: 5855ms min, 8779ms max 7871ms median
gc time: 4022ms min, 6953ms max, 6006ms median

The example builds a linked list which should behave similarly given either strategy so I can't really explain the results I'm seeing.

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

@vchuravy
Copy link
Member

vchuravy commented Jun 9, 2022

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

We should understand the performance better first.

@d-netto d-netto mentioned this pull request Jun 10, 2022
3 tasks
@d-netto
Copy link
Member Author

d-netto commented Jun 24, 2022

Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

Issue is that you need to read the tag whenever you enqueue an object, so you incur a cache miss there and prefetching objs when you enqueue them doesn't make sense.

Maybe we could change the order of reading tags so that we could enable prefetching in the future, but I think that would be another PR...

@d-netto d-netto force-pushed the dcn/ml branch 2 times, most recently from 7032ab5 to 6f42f17 Compare July 3, 2022 14:59
Diogo Netto and others added 4 commits July 5, 2022 17:32
@d-netto
Copy link
Member Author

d-netto commented Jul 5, 2022

TODO list should be complete now. Updated link at the top, which contains benchmark results for one of the latest commits.

Interestingly, list.jl had opposite trends on AMD/Intel machines, though it's not clear to me why.

I also ran this on https://github.com/h2oai/db-benchmark and found out that queue sizes didn't go beyond 2^17 (read somewhere this was the default capacity in jvm) there.

Both (master/pr) modes of traversal have adversarial examples in which the queues can get very large. I believe we should be asking whether these pathological examples would be common enough to justify batching/chunking.

@vchuravy
Copy link
Member

vchuravy commented Jul 6, 2022

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@vchuravy
Copy link
Member

vchuravy commented Jul 8, 2022

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@d-netto
Copy link
Member Author

d-netto commented Jul 12, 2022

I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before

Lot of build/test failures went away after merging master. Perhaps nanosoldier it again?

@maleadt
Copy link
Member

maleadt commented Jul 12, 2022

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@d-netto
Copy link
Member Author

d-netto commented Jul 31, 2022

@yuyichao, it seems that the remset is flushed at the beginning of every collection and reconstructed as the graph is traversed.

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I tried to only flush the resmet on full collections and never re-add objects to it during the traversal (e.g. commented the push_remset functions) but this leads to a segfault. Do you know why this could be the case (EDIT: can include backtrace if it helps)?

My only guess is that there could be old->young references that are somehow not recorded in the write barrier, but this doesn't seem right...

@yuyichao
Copy link
Contributor

yuyichao commented Aug 5, 2022

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I think that'll be too inaccurate. Promotion happens on generational collection as well and not resetting the remset means that any newly promoted old objects has to wait until the next full collection for the promotion to actually become beneficial.

commented the push_remset functions

I don't think this would work directly at the least (assuming you've got all other potential corner cases right). As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

@d-netto
Copy link
Member Author

d-netto commented Aug 5, 2022

As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

I can only see old->young references being created on a small collection if there is a subset of objects (EDIT: young gen) that is promoted and another that's not. Is that the case?

@yuyichao
Copy link
Contributor

yuyichao commented Aug 5, 2022

I can only see old->young references being created on a small collection if there is a subset of objects that is promoted and another that's not. Is that the case?

I assume you are asking whether the GC will promote a non-trivial subset of the live objects to old? Yes, the promotion is independent for different objects and since it took more than one cycle to promote the objects won't be sync'd. See the state transition diagram in the comment gc.c.

@d-netto
Copy link
Member Author

d-netto commented Oct 22, 2022

Superseded by #47292.

@d-netto d-netto closed this Oct 22, 2022
@d-netto d-netto mentioned this pull request Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants