Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace SML-level implementation of ABP deque with implementation in C #214

Merged
merged 1 commit into from
Jan 5, 2025

Conversation

shwestrick
Copy link
Collaborator

@shwestrick shwestrick commented Jan 4, 2025

This makes it possible to use the appropriate memory order annotations, to ensure correct compilation on weakly ordered architectures such as ARM.

AFAICT, this fixes #212. On a MacBook Air M2 (2022) I'm no longer seeing these errors in my tests.

I used memory ordering annotations inspired by the ParlayLib implementation:
https://github.com/cmuparlay/parlaylib/blob/36459f42a84207330eae706c47e6fab712e6a149/include/parlay/internal/work_stealing_deque.h

One difference between our deque and the standard ABP deque is that we use the current top index of the deque to interface between the scheduler and the GC. For example, an empty deque with top=5 indicates that depths 1-4 are out of scope of LGC (because the corresponding tasks have been stolen). So, when our deque becomes empty, rather than reset the top/bottom indices to 0 (as a standard ABP deque would do), we instead keep them set at the current value.

I believe this patch may also fix another subtle bug related to the deque, specifically the tag of the packed tag+idx not being advanced here:

else if newBot < idx then
(* We are racing with a concurrent steal to take this single
* element x, but we already lost the race. So we only need to set
* the bottom to match the idx, i.e. set the deque to a proper
* empty state. *)
(* TODO: can we relax the cas to a normal write?
* Note that this cas is guaranteed to succeed... *)
(cas32 bot (newBot, idx); NONE)

I don't have any evidence of a failing test or otherwise, but it seems like it can cause an ABA problem, if another concurrent steal were in-flight at the same index but then managed to succeed later with the same top tag. This patch correctly advances the top tag in this case:
https://github.com/shwestrick/mpl/blob/f6d02f4025ba5df0c2cac533c85db52e68f3dfba/runtime/gc/abp-deque.c#L72-L80

@shwestrick
Copy link
Collaborator Author

shwestrick commented Jan 4, 2025

Testing is going well. Performance differences are mild; generally ±5%.

Speedup plots, relative to commit 70bfe0f:
vs-mpl-70bfe0f-ng.pdf ("no grain" benchmarks)
vs-mpl-70bfe0f.pdf (manually tuned benchmarks)

I suspect the performance degradations of bignum-add and primes-ng are due to whole-program compilation differences; manually optimized versions of these benchmarks do not see significant impact.

I looked into the sparse-mxv-csr outliers and these appear to be due to a NUMA issue, unrelated to this patch. I see variability in performance, both before and after this patch, when running separate instances of this benchmark back-to-back. This explains the ups and downs in the plots. The variability seems to go away when I use numactl -i all. Disabling pthread affinities also seems to have the same affect. (This makes sense; perhaps the OS is relocating MPL's threads to optimize for NUMA access, but it is not permitted to do so when we use @mpl set-affinity --).

This makes it possible to use the appropriate memory order annotations,
to ensure correct compilation on weakly ordered architectures such as ARM.

I used memory ordering annotations inspired by the parlaylib implementation:
  https://github.com/cmuparlay/parlaylib/blob/36459f42a84207330eae706c47e6fab712e6a149/include/parlay/internal/work_stealing_deque.h

I believe this patch may also fix another subtle bug related to the deque,
specifically the tag of the packed tag+idx not being advanced here:
  https://github.com/MPLLang/mpl/blob/70bfe0fc4eef5c5ead66eb2af8698b42185585aa/basis-library/schedulers/shh/queue/DequeABP.sml#L223-L230
I don't have any evidence of a failing test or otherwise, but it seems
like it can cause an ABA problem, if another concurrent steal were
in-flight at the same index but then managed to succeed later with the
same top tag.
@shwestrick
Copy link
Collaborator Author

Found (I think) an ABA bug involving the "set depth" operation. This was triggering very infrequently in the form of an error like this:

ERROR  [P00|./gc/abp-deque.c:168]: Bug! Attempt to set depth of non-empty deque! tag=76720 top=2 bot=1

My best guess is that a in-flight steal managed to overlap with multiple set_depth operations,
creating an ABA where the steal succeeded but should not have been able to.

In hindsight, it's pretty clear that the top tag needs to advance at set_depth, to invalidate all in-flight steals and force them to fail. The force-push commit (above) fixes that.

AFAICT, the bug now appears to be gone, although it was a heisenbug, so...

@shwestrick
Copy link
Collaborator Author

Another useful reference is "Correct and Efficient Work-Stealing for Weak Memory Models" by Lê, Pop, Cohen, and Nardelli: https://fzn.fr/readings/ppopp13.pdf

Their implementation is based on the Chase-Lev deque: https://www.dre.vanderbilt.edu/~schmidt/PDF/work-stealing-dequeue.pdf

In comparison to the ABP deque, the Chase-Lev deque is based on a ring buffer and it avoids the separate packed tag (sequence number) by only ever advancing the top index, and modding to fit the current range. Otherwise, it's conceptually very similar to the ABP deque. This paper suggests that a sequentially consistent fence is needed between the two load-acquires at the beginning of each steal, so I added this to our implementation.

@shwestrick
Copy link
Collaborator Author

shwestrick commented Jan 5, 2025

I've continued to test this pretty thoroughly, and I'm happy with it.

@shwestrick shwestrick merged commit a755a88 into MPLLang:main Jan 5, 2025
@shwestrick shwestrick deleted the fix-abp-deque branch January 5, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler bug on ARM
1 participant