GC mark-loop rewrite #45608

d-netto · 2022-06-07T21:01:03Z

This PR simplifies the GC mark-loop code. Some benchmark results are shown here: https://hackmd.io/XZDp98lTR3qyYgsWKCTvnQ. I believe these changes could bring some performance benefits as well (particularly if used with prefetching in the mark-loop, c.f. Cher 2004).

Previous work

Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found.

This PR

This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of jl_value_t * currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the remset, and we ensure this by calling gc_mark_outrefs in gc_queue_remset with meta_updated set to 1.

Additional advantages

There are no recursive function calls in the GC mark-loop code (one of the reasons why Implement GC mark loop #21590 was implemented).
Keeping a single GC queue will greatly simplify work-stealing in the multi-threaded GC we are working on (c.f. [rfc] parallel marking #45639).
Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
We can easily modify the traversal mode (to breadth first, for example) by only changing the jl_gc_markqueue_t(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.

TODO

Fix GC debugging infrastructure.
Run more benchmarks.
Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

CC: @vchuravy, @chflood, @kpamnany

yuyichao · 2022-06-07T21:49:18Z

To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

So IIUC this changed from using a depth first search (DFS) to a breadth first search (BFS). The main reason the marking is done in the current way was because of @vtjnash 's concern on the memory consumption. It might not be trivial to measure but I assume it's mainly affecting cases when there are complex objects and large arrays (i.e. things with lots of objects).

d-netto · 2022-06-07T22:49:09Z

If your work-list is a stack, still DFS.

But yes, this could increase memory consumption for the GC queues in the case of very large arrays (and a single threaded collector) you mentioned above. Not sure if the memory tradeoff is that obvious for a parallel GC (some work-stealing queues are implemented as circular buffers).

yuyichao · 2022-06-08T02:34:04Z

BFS was the wrong name I guess, but this is the order that I was trying to avoid for the current version and the only reason it's done this way (cuz just collecting all the reference is of course the more natural and easy way to do it.......). Strictly maintaining the order makes sure this is not going to be any more of an issue compared to the version it replaced.

I'm not sure how much the impact would be (it should be larger by the number of references though of course the memory per object is less) since I don't know the average number of references per object in a typical workload. It is quite important to make sure the single thread case is not affected by much, especially regarding any worst cases...

chflood · 2022-06-08T20:09:09Z

I believe this is a good change. We probably want to add some special handling code for chunking big arrays at some point. I ran this and on my machine list.jl got significantly worse:
Before:
test = "list.jl"
run time: 4829ms min, 7394ms max 6081ms median
gc time: 3060ms min, 5584ms max, 4337ms median

After:
test = "list.jl"
run time: 5855ms min, 8779ms max 7871ms median
gc time: 4022ms min, 6953ms max, 6006ms median

The example builds a linked list which should behave similarly given either strategy so I can't really explain the results I'm seeing.

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

vchuravy · 2022-06-09T13:58:19Z

Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win.

We should understand the performance better first.

d-netto · 2022-06-24T15:33:00Z

Test whether prefetching on the mark-loop (c.f. Cher, 2004) would improve GC times.

Issue is that you need to read the tag whenever you enqueue an object, so you incur a cache miss there and prefetching objs when you enqueue them doesn't make sense.

Maybe we could change the order of reading tags so that we could enable prefetching in the future, but I think that would be another PR...

d-netto · 2022-07-05T23:35:01Z

TODO list should be complete now. Updated link at the top, which contains benchmark results for one of the latest commits.

Interestingly, list.jl had opposite trends on AMD/Intel machines, though it's not clear to me why.

I also ran this on https://github.com/h2oai/db-benchmark and found out that queue sizes didn't go beyond 2^17 (read somewhere this was the default capacity in jvm) there.

Both (master/pr) modes of traversal have adversarial examples in which the queues can get very large. I believe we should be asking whether these pathological examples would be common enough to justify batching/chunking.

vchuravy · 2022-07-06T01:22:46Z

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

nanosoldier · 2022-07-07T00:00:59Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

vchuravy · 2022-07-08T00:44:17Z

@nanosoldier runbenchmarks(!"scalar", vs=":master")

nanosoldier · 2022-07-08T06:43:19Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

d-netto · 2022-07-12T00:46:12Z

I haven't seen https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/8dae1bb_vs_d6c5cb6/PlantHydraulics.primary.log before

Lot of build/test failures went away after merging master. Perhaps nanosoldier it again?

maleadt · 2022-07-12T05:51:05Z

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

nanosoldier · 2022-07-13T00:39:46Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

d-netto · 2022-07-31T16:13:13Z

@yuyichao, it seems that the remset is flushed at the beginning of every collection and reconstructed as the graph is traversed.

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I tried to only flush the resmet on full collections and never re-add objects to it during the traversal (e.g. commented the push_remset functions) but this leads to a segfault. Do you know why this could be the case (EDIT: can include backtrace if it helps)?

My only guess is that there could be old->young references that are somehow not recorded in the write barrier, but this doesn't seem right...

yuyichao · 2022-08-05T02:58:36Z

Do you know what is the motivation behind this, instead of just flushing the remset on full collections?

I think that'll be too inaccurate. Promotion happens on generational collection as well and not resetting the remset means that any newly promoted old objects has to wait until the next full collection for the promotion to actually become beneficial.

commented the push_remset functions

I don't think this would work directly at the least (assuming you've got all other potential corner cases right). As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

d-netto · 2022-08-05T16:01:29Z

As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this.

I can only see old->young references being created on a small collection if there is a subset of objects (EDIT: young gen) that is promoted and another that's not. Is that the case?

yuyichao · 2022-08-05T16:32:17Z

I can only see old->young references being created on a small collection if there is a subset of objects that is promoted and another that's not. Is that the case?

I assume you are asking whether the GC will promote a non-trivial subset of the live objects to old? Yes, the promotion is independent for different objects and since it took more than one cycle to promote the objects won't be sync'd. See the state transition diagram in the comment gc.c.

d-netto · 2022-10-22T18:45:11Z

Superseded by #47292.

vchuravy requested review from vtjnash, chflood and yuyichao June 7, 2022 21:42

jpsamaroo added the GC Garbage collector label Jun 8, 2022

d-netto mentioned this pull request Jun 10, 2022

[rfc] parallel marking #45639

Closed

3 tasks

d-netto force-pushed the dcn/ml branch 2 times, most recently from 7032ab5 to 6f42f17 Compare July 3, 2022 14:59

Diogo Netto and others added 10 commits July 5, 2022 10:29

single gc queue

59fc823

only two gc bits

871bd2e

debugging infra

f98429b

docstrings

0815f90

1 outref opt

2a6fa8b

compiler warning

f01e90a

deleted tmp files

e71cc48

spacing

8d41596

fmt

27aca87

fmt

99a767c

d-netto force-pushed the dcn/ml branch from 5f878e6 to 99a767c Compare July 5, 2022 14:37

Diogo Netto and others added 4 commits July 5, 2022 17:32

gc-debug

c85f362

ws

2c703f0

finlist

58d3de3

fl

8dae1bb

Diogo Netto and others added 2 commits July 9, 2022 22:17

init_size

ea83f4c

Merge branch 'master' into dcn/ml

69b1856

squash

eae1754

d-netto force-pushed the dcn/ml branch 2 times, most recently from 7c9b2d2 to eae1754 Compare August 15, 2022 16:31

d-netto closed this Sep 7, 2022

d-netto force-pushed the dcn/ml branch from 175f01e to cfec173 Compare September 7, 2022 01:56

d-netto reopened this Sep 7, 2022

d-netto force-pushed the dcn/ml branch from 175f01e to eae1754 Compare September 7, 2022 01:59

Diogo Netto added 6 commits October 11, 2022 11:00

rm prefetch opt

752cf3b

rm single out-ref opt

0de5f5f

other micro-opt for one out-ref

aac04b8

ported batching infra

9518f90

analyzer fixes

9d88df4

mark_finlist shouldn't be static

8b42ac8

d-netto force-pushed the dcn/ml branch from 881d1dc to 8b42ac8 Compare October 17, 2022 17:14

rm old ifdefs

5b01c00

d-netto closed this Oct 22, 2022

d-netto mentioned this pull request Oct 25, 2022

GC mark-loop rewrite #47292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC mark-loop rewrite #45608

GC mark-loop rewrite #45608

d-netto commented Jun 7, 2022 •

edited

Loading

yuyichao commented Jun 7, 2022

d-netto commented Jun 7, 2022 •

edited

Loading

yuyichao commented Jun 8, 2022

chflood commented Jun 8, 2022

vchuravy commented Jun 9, 2022

d-netto commented Jun 24, 2022

d-netto commented Jul 5, 2022 •

edited

Loading

vchuravy commented Jul 6, 2022

nanosoldier commented Jul 7, 2022

vchuravy commented Jul 8, 2022

nanosoldier commented Jul 8, 2022

d-netto commented Jul 12, 2022

maleadt commented Jul 12, 2022

nanosoldier commented Jul 13, 2022

d-netto commented Jul 31, 2022 •

edited

Loading

yuyichao commented Aug 5, 2022

d-netto commented Aug 5, 2022 •

edited

Loading

yuyichao commented Aug 5, 2022

d-netto commented Oct 22, 2022

GC mark-loop rewrite #45608

GC mark-loop rewrite #45608

Conversation

d-netto commented Jun 7, 2022 • edited Loading

Previous work

This PR

Additional advantages

TODO

yuyichao commented Jun 7, 2022

d-netto commented Jun 7, 2022 • edited Loading

yuyichao commented Jun 8, 2022

chflood commented Jun 8, 2022

vchuravy commented Jun 9, 2022

d-netto commented Jun 24, 2022

d-netto commented Jul 5, 2022 • edited Loading

vchuravy commented Jul 6, 2022

nanosoldier commented Jul 7, 2022

vchuravy commented Jul 8, 2022

nanosoldier commented Jul 8, 2022

d-netto commented Jul 12, 2022

maleadt commented Jul 12, 2022

nanosoldier commented Jul 13, 2022

d-netto commented Jul 31, 2022 • edited Loading

yuyichao commented Aug 5, 2022

d-netto commented Aug 5, 2022 • edited Loading

yuyichao commented Aug 5, 2022

d-netto commented Oct 22, 2022

d-netto commented Jun 7, 2022 •

edited

Loading

d-netto commented Jun 7, 2022 •

edited

Loading

d-netto commented Jul 5, 2022 •

edited

Loading

d-netto commented Jul 31, 2022 •

edited

Loading

d-netto commented Aug 5, 2022 •

edited

Loading