-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix major source of runtime instability on non-x86 based platforms #3871
Conversation
Hi @SeanTAllen, The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do. Release notes are added by creating a uniquely named file in the The basic format of the release notes (using markdown) should be:
Thanks. |
Within our pool and mpmcq implementations, we have code for handling the ABA problem. The code is split into an X86 version and a "not X86" version which at this time means "Arm". The X86 version is heavily tested and works quite well. It has a theoretical problem that needs to be addressed by deeper changes, but overall, it is very solid. The Arm code, on the other hand, isn't working properly. What exactly is wrong isn't something that I've determined. What I can say is that when in use, I am seeing issues with asserts being triggered related to scheduler muting and memory pool usage. These asserts should only trigger if something is very wrong. The assert related problems felt like "an atomics issue" and really, like a problem around the ABA code. I tested the changes in this PR on both 32-bit and 64-bit ARM and all the problems disappeared. In particular, the test was to run the message ubench stress test for approximately 18 hours. With these changes, it ran fine. Without, I would see weird failure and what looked like memory corruption anywhere from immediately to within 30 minutes. Longer term, we need to replace the ABA related code in the Pony runtime with a port of the queue and ABA code from Verona including its epoch implementation. That change is a lot deeper and more complicated. In the meantime, the changes in this PR will bring us very close to Pony being usable on Arm machines. See https://en.wikipedia.org/wiki/ABA_problem if you aren't familiar with the ABA problem and would like more background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified that this also worked on the M1
Within our pool and mpmcq implementations, we have code for handling
the ABA problem. The code is split into an X86 version and a "not X86"
version which at this time means "Arm".
The X86 version is heavily tested and works quite well. It has a theoretical
problem that needs to be addressed by deeper changes, but overall, it
is very solid.
The Arm code, on the other hand, isn't working properly. What exactly is wrong
isn't something that I've determined. What I can say is that when in use,
I am seeing issues with asserts being triggered related to scheduler muting
and memory pool usage. These asserts should only trigger if something is
very wrong.
The assert related problems felt like "an atomics issue" and really, like
a problem around the ABA code. I tested the changes in this PR on both 32-bit
and 64-bit ARM and all the problems disappeared. In particular, the test
was to run the message ubench stress test for approximately 18 hours. With
these changes, it ran fine. Without, I would see weird failure and what
looked like memory corruption anywhere from immediately to within 30 minutes.
Longer term, we need to replace the ABA related code in the Pony runtime with
a port of the queue and ABA code from Verona including its epoch implementation.
That change is a lot deeper and more complicated. In the meantime, the changes
in this PR will bring us very close to Pony being usable on Arm machines.
See https://en.wikipedia.org/wiki/ABA_problem if you aren't familiar
with the ABA problem and would like more background.