-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC mark-loop rewrite #45608
GC mark-loop rewrite #45608
Conversation
So IIUC this changed from using a depth first search (DFS) to a breadth first search (BFS). The main reason the marking is done in the current way was because of @vtjnash 's concern on the memory consumption. It might not be trivial to measure but I assume it's mainly affecting cases when there are complex objects and large arrays (i.e. things with lots of objects). |
If your work-list is a stack, still DFS. But yes, this could increase memory consumption for the GC queues in the case of very large arrays (and a single threaded collector) you mentioned above. Not sure if the memory tradeoff is that obvious for a parallel GC (some work-stealing queues are implemented as circular buffers). |
BFS was the wrong name I guess, but this is the order that I was trying to avoid for the current version and the only reason it's done this way (cuz just collecting all the reference is of course the more natural and easy way to do it.......). Strictly maintaining the order makes sure this is not going to be any more of an issue compared to the version it replaced. I'm not sure how much the impact would be (it should be larger by the number of references though of course the memory per object is less) since I don't know the average number of references per object in a typical workload. It is quite important to make sure the single thread case is not affected by much, especially regarding any worst cases... |
I believe this is a good change. We probably want to add some special handling code for chunking big arrays at some point. I ran this and on my machine list.jl got significantly worse: After: The example builds a linked list which should behave similarly given either strategy so I can't really explain the results I'm seeing. Regardless I don't think this is a show stopper and I believe the code simplification of not managing shadow queues is a big win. |
We should understand the performance better first. |
Issue is that you need to read the tag whenever you enqueue an object, so you incur a cache miss there and prefetching objs when you enqueue them doesn't make sense. Maybe we could change the order of reading tags so that we could enable prefetching in the future, but I think that would be another PR... |
7032ab5
to
6f42f17
Compare
TODO list should be complete now. Updated link at the top, which contains benchmark results for one of the latest commits. Interestingly, I also ran this on https://github.com/h2oai/db-benchmark and found out that queue sizes didn't go beyond 2^17 (read somewhere this was the default capacity in jvm) there. Both (master/pr) modes of traversal have adversarial examples in which the queues can get very large. I believe we should be asking whether these pathological examples would be common enough to justify batching/chunking. |
@nanosoldier |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Lot of build/test failures went away after merging master. Perhaps nanosoldier it again? |
@nanosoldier |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. |
@yuyichao, it seems that the Do you know what is the motivation behind this, instead of just flushing the I tried to only flush the My only guess is that there could be old->young references that are somehow not recorded in the write barrier, but this doesn't seem right... |
I think that'll be too inaccurate. Promotion happens on generational collection as well and not resetting the remset means that any newly promoted old objects has to wait until the next full collection for the promotion to actually become beneficial.
I don't think this would work directly at the least (assuming you've got all other potential corner cases right). As mentioned above, promotion will happen between full collection and the marking pass has to discover any new old->new generated because of this. |
I can only see old->young references being created on a small collection if there is a subset of objects (EDIT: young gen) that is promoted and another that's not. Is that the case? |
I assume you are asking whether the GC will promote a non-trivial subset of the live objects to old? Yes, the promotion is independent for different objects and since it took more than one cycle to promote the objects won't be sync'd. See the state transition diagram in the comment |
7c9b2d2
to
eae1754
Compare
Superseded by #47292. |
This PR simplifies the GC mark-loop code. Some benchmark results are shown here: https://hackmd.io/XZDp98lTR3qyYgsWKCTvnQ. I believe these changes could bring some performance benefits as well (particularly if used with prefetching in the mark-loop, c.f. Cher 2004).
Previous work
Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found.
This PR
This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of
jl_value_t *
currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the
remset
, and we ensure this by callinggc_mark_outrefs
ingc_queue_remset
withmeta_updated
set to 1.Additional advantages
jl_gc_markqueue_t
(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.TODO
CC: @vchuravy, @chflood, @kpamnany