replace SML-level implementation of ABP deque with implementation in C #214

shwestrick · 2025-01-04T18:43:12Z

This makes it possible to use the appropriate memory order annotations, to ensure correct compilation on weakly ordered architectures such as ARM.

AFAICT, this fixes #212. On a MacBook Air M2 (2022) I'm no longer seeing these errors in my tests.

I used memory ordering annotations inspired by the ParlayLib implementation:
https://github.com/cmuparlay/parlaylib/blob/36459f42a84207330eae706c47e6fab712e6a149/include/parlay/internal/work_stealing_deque.h

One difference between our deque and the standard ABP deque is that we use the current top index of the deque to interface between the scheduler and the GC. For example, an empty deque with top=5 indicates that depths 1-4 are out of scope of LGC (because the corresponding tasks have been stolen). So, when our deque becomes empty, rather than reset the top/bottom indices to 0 (as a standard ABP deque would do), we instead keep them set at the current value.

I believe this patch may also fix another subtle bug related to the deque, specifically the tag of the packed tag+idx not being advanced here:

mpl/basis-library/schedulers/shh/queue/DequeABP.sml

Lines 223 to 230 in 70bfe0f

    
                     else if newBot < idx then 
        
                       (* We are racing with a concurrent steal to take this single 
        
                        * element x, but we already lost the race. So we only need to set 
        
                        * the bottom to match the idx, i.e. set the deque to a proper 
        
                        * empty state. *) 
        
                       (* TODO: can we relax the cas to a normal write? 
        
                        * Note that this cas is guaranteed to succeed... *) 
        
                       (cas32 bot (newBot, idx); NONE)

I don't have any evidence of a failing test or otherwise, but it seems like it can cause an ABA problem, if another concurrent steal were in-flight at the same index but then managed to succeed later with the same top tag. This patch correctly advances the top tag in this case:
https://github.com/shwestrick/mpl/blob/f6d02f4025ba5df0c2cac533c85db52e68f3dfba/runtime/gc/abp-deque.c#L72-L80

shwestrick · 2025-01-04T19:04:51Z

Testing is going well. Performance differences are mild; generally ±5%.

Speedup plots, relative to commit 70bfe0f:
vs-mpl-70bfe0f-ng.pdf ("no grain" benchmarks)
vs-mpl-70bfe0f.pdf (manually tuned benchmarks)

I suspect the performance degradations of bignum-add and primes-ng are due to whole-program compilation differences; manually optimized versions of these benchmarks do not see significant impact.

I looked into the sparse-mxv-csr outliers and these appear to be due to a NUMA issue, unrelated to this patch. I see variability in performance, both before and after this patch, when running separate instances of this benchmark back-to-back. This explains the ups and downs in the plots. The variability seems to go away when I use numactl -i all. Disabling pthread affinities also seems to have the same affect. (This makes sense; perhaps the OS is relocating MPL's threads to optimize for NUMA access, but it is not permitted to do so when we use @mpl set-affinity --).

This makes it possible to use the appropriate memory order annotations, to ensure correct compilation on weakly ordered architectures such as ARM. I used memory ordering annotations inspired by the parlaylib implementation: https://github.com/cmuparlay/parlaylib/blob/36459f42a84207330eae706c47e6fab712e6a149/include/parlay/internal/work_stealing_deque.h I believe this patch may also fix another subtle bug related to the deque, specifically the tag of the packed tag+idx not being advanced here: https://github.com/MPLLang/mpl/blob/70bfe0fc4eef5c5ead66eb2af8698b42185585aa/basis-library/schedulers/shh/queue/DequeABP.sml#L223-L230 I don't have any evidence of a failing test or otherwise, but it seems like it can cause an ABA problem, if another concurrent steal were in-flight at the same index but then managed to succeed later with the same top tag.

shwestrick · 2025-01-04T21:13:27Z

Found (I think) an ABA bug involving the "set depth" operation. This was triggering very infrequently in the form of an error like this:

ERROR  [P00|./gc/abp-deque.c:168]: Bug! Attempt to set depth of non-empty deque! tag=76720 top=2 bot=1

My best guess is that a in-flight steal managed to overlap with multiple set_depth operations,
creating an ABA where the steal succeeded but should not have been able to.

In hindsight, it's pretty clear that the top tag needs to advance at set_depth, to invalidate all in-flight steals and force them to fail. The force-push commit (above) fixes that.

AFAICT, the bug now appears to be gone, although it was a heisenbug, so...

shwestrick · 2025-01-05T14:50:42Z

Another useful reference is "Correct and Efficient Work-Stealing for Weak Memory Models" by Lê, Pop, Cohen, and Nardelli: https://fzn.fr/readings/ppopp13.pdf

Their implementation is based on the Chase-Lev deque: https://www.dre.vanderbilt.edu/~schmidt/PDF/work-stealing-dequeue.pdf

In comparison to the ABP deque, the Chase-Lev deque is based on a ring buffer and it avoids the separate packed tag (sequence number) by only ever advancing the top index, and modding to fit the current range. Otherwise, it's conceptually very similar to the ABP deque. This paper suggests that a sequentially consistent fence is needed between the two load-acquires at the beginning of each steal, so I added this to our implementation.

shwestrick · 2025-01-05T18:36:06Z

I've continued to test this pretty thoroughly, and I'm happy with it.

shwestrick force-pushed the fix-abp-deque branch from f6d02f4 to eef6f00 Compare January 4, 2025 21:04

shwestrick merged commit a755a88 into MPLLang:main Jan 5, 2025

shwestrick deleted the fix-abp-deque branch January 5, 2025 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace SML-level implementation of ABP deque with implementation in C #214

replace SML-level implementation of ABP deque with implementation in C #214

shwestrick commented Jan 4, 2025 •

edited

Loading

shwestrick commented Jan 4, 2025 •

edited

Loading

shwestrick commented Jan 4, 2025

shwestrick commented Jan 5, 2025

shwestrick commented Jan 5, 2025 •

edited

Loading

	else if newBot < idx then
	(* We are racing with a concurrent steal to take this single
	* element x, but we already lost the race. So we only need to set
	* the bottom to match the idx, i.e. set the deque to a proper
	* empty state. *)
	(* TODO: can we relax the cas to a normal write?
	* Note that this cas is guaranteed to succeed... *)
	(cas32 bot (newBot, idx); NONE)

replace SML-level implementation of ABP deque with implementation in C #214

replace SML-level implementation of ABP deque with implementation in C #214

Conversation

shwestrick commented Jan 4, 2025 • edited Loading

shwestrick commented Jan 4, 2025 • edited Loading

shwestrick commented Jan 4, 2025

shwestrick commented Jan 5, 2025

shwestrick commented Jan 5, 2025 • edited Loading

shwestrick commented Jan 4, 2025 •

edited

Loading

shwestrick commented Jan 4, 2025 •

edited

Loading

shwestrick commented Jan 5, 2025 •

edited

Loading