Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix major source of runtime instability on non-x86 based platforms #3871

Merged
merged 1 commit into from
Oct 1, 2021

Conversation

SeanTAllen
Copy link
Member

Within our pool and mpmcq implementations, we have code for handling
the ABA problem. The code is split into an X86 version and a "not X86"
version which at this time means "Arm".

The X86 version is heavily tested and works quite well. It has a theoretical
problem that needs to be addressed by deeper changes, but overall, it
is very solid.

The Arm code, on the other hand, isn't working properly. What exactly is wrong
isn't something that I've determined. What I can say is that when in use,
I am seeing issues with asserts being triggered related to scheduler muting
and memory pool usage. These asserts should only trigger if something is
very wrong.

The assert related problems felt like "an atomics issue" and really, like
a problem around the ABA code. I tested the changes in this PR on both 32-bit
and 64-bit ARM and all the problems disappeared. In particular, the test
was to run the message ubench stress test for approximately 18 hours. With
these changes, it ran fine. Without, I would see weird failure and what
looked like memory corruption anywhere from immediately to within 30 minutes.

Longer term, we need to replace the ABA related code in the Pony runtime with
a port of the queue and ABA code from Verona including its epoch implementation.
That change is a lot deeper and more complicated. In the meantime, the changes
in this PR will bring us very close to Pony being usable on Arm machines.

See https://en.wikipedia.org/wiki/ABA_problem if you aren't familiar
with the ABA problem and would like more background.

@SeanTAllen SeanTAllen requested review from ergl and jemc September 30, 2021 21:11
@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Sep 30, 2021
@ponylang-main
Copy link
Contributor

Hi @SeanTAllen,

The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 3871.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@SeanTAllen SeanTAllen changed the title Use "x86 ABA" code on all platforms Fix major source of runtime instability on non-x86 based platforms Sep 30, 2021
Within our pool and mpmcq implementations, we have code for handling
the ABA problem. The code is split into an X86 version and a "not X86"
version which at this time means "Arm".

The X86 version is heavily tested and works quite well. It has a theoretical
problem that needs to be addressed by deeper changes, but overall, it
is very solid.

The Arm code, on the other hand, isn't working properly. What exactly is wrong
isn't something that I've determined. What I can say is that when in use,
I am seeing issues with asserts being triggered related to scheduler muting
and memory pool usage. These asserts should only trigger if something is
very wrong.

The assert related problems felt like "an atomics issue" and really, like
a problem around the ABA code. I tested the changes in this PR on both 32-bit
and 64-bit ARM and all the problems disappeared. In particular, the test
was to run the message ubench stress test for approximately 18 hours. With
these changes, it ran fine. Without, I would see weird failure and what
looked like memory corruption anywhere from immediately to within 30 minutes.

Longer term, we need to replace the ABA related code in the Pony runtime with
a port of the queue and ABA code from Verona including its epoch implementation.
That change is a lot deeper and more complicated. In the meantime, the changes
in this PR will bring us very close to Pony being usable on Arm machines.

See https://en.wikipedia.org/wiki/ABA_problem if you aren't familiar
with the ABA problem and would like more background.
Copy link
Member

@ergl ergl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that this also worked on the M1

@SeanTAllen SeanTAllen merged commit 2fe2be5 into main Oct 1, 2021
@SeanTAllen SeanTAllen deleted the aba branch October 1, 2021 18:41
github-actions bot pushed a commit that referenced this pull request Oct 1, 2021
github-actions bot pushed a commit that referenced this pull request Oct 1, 2021
SeanTAllen added a commit that referenced this pull request Feb 5, 2022
…eters" (#3995)

The bug that [#3781](#3781) aka "prevent non-opaque structs from being used as behaviour parameters" was an attempt at preventing, as only partially solved by the change.

We've introduced a full fix in [#3993](#3993) that removes the need for [#3871](#3781).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants