<semaphore> Implement semaphore - WIP #420

HugGa · 2020-01-15T00:23:11Z

Description

This is an implementation of counting_semaphore and binary_semaphore (See: #52 )

I made a basic implementation of the semaphore header, using an atomic_ptrdiff_t instead of the given ptrdiff_t, my reasoning for this was that the semaphore library currently lists the counting_semaphore as doing atomic operations and the only non-mutex way I could think of doing atomic operations here. I'm definitely willing to change it to be something else if required.

I have also tested this (slightly) on my home computer, but have not really ultra-stress tested it other than bleak fooling around.

A question that I have for the implementation itself: what value should be the ultimate hard-cap for the max function? I currently have it set to just copy the value pushed in from _LeastMaxValue but I know that there is a maximum value out there, so should the function just make sure that the value of _LeastMaxValue is above 0 via static_assert and use that or have some sort of hard-cap (PTRDIFF_MAX?) for this.

Checklist

Be sure you've read README.md and understand the scope of this repo.

If you're unsure about a box, leave it unchecked. A maintainer will help you.

Identifiers in product code changes are properly _Ugly as per
https://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
The STL builds successfully and all tests have passed (must be manually
verified by an STL maintainer before automated testing is enabled on GitHub,
leave this unchecked for initial submission).
These changes introduce no known ABI breaks (adding members, renaming
members, adding virtual functions, changing whether a type is an aggregate
or trivially copyable, etc.).
These changes were written from scratch using only this repository,
the C++ Working Draft (including any cited standards), other WG21 papers
(excluding reference implementations outside of proposed standard wording),
and LWG issues as reference material. If they were derived from a project
that's already listed in NOTICE.txt, that's fine, but please mention it.
If they were derived from any other project (including Boost and libc++,
which are not yet listed in NOTICE.txt), you must mention it here,
so we can determine whether the license is compatible and what else needs
to be done.

msftclas · 2020-01-15T00:23:23Z

All CLA requirements met.

MikeGitb · 2020-01-15T07:20:04Z

Shouldn't semaphores use OS primitives when blocking instead of just busy waiting?

stl/inc/semaphore

HugGa · 2020-01-15T11:49:51Z

The reason I didn't use the OS primitive is the requirement of the constructor to be constexpr, which as far as I can tell completely circumvents this possibility unless there's a LOT more to CreateSemaphore
and CreateSemaphoreEX that I don't know about or that there's a constexpr way of evaluating such a thing.

sylveon · 2020-01-15T16:09:11Z

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

MikeGitb · 2020-01-15T20:03:55Z

Thats a bummer. Maybe lazy construction would be a thing, but busy waiting would make this imho completely unusable. Many use cases for semaphores can block a thread for quite a long time.

HugGa · 2020-01-15T20:21:54Z

@MikeGitb would yielding the thread if the semaphore doesn't immediately acquire make this better? Maybe we should send something to the committee and ask them to add something to the standard to make that standard across all platforms? I know that's not an ideal solution (and besides, personally due to my inter-process requirements on my own stuff I roll my own semaphore class for named semaphores) but it's better than nothing here.

miscco · 2020-01-15T21:10:40Z

You should try to better follow the conventions from the STL.

This concerns the identifiers that generally do not use double underscores __foo vs _Foo.
_STL_ASSERT is used rather than asserts and when possible you should const qualify function arguments in new code.

MikeGitb · 2020-01-15T21:24:31Z

@MikeGitb would yielding the thread if the semaphore doesn't immediately acquire make this better

It would certainly improve the situation. Spinlocks often have a multistage backoff mechanism (first spin+pause instructions, then yield, then sleep), but I'm no expert on spinlocks. But that's kind of my point: semaphores are supposed to be more intelligent than simple spinlocks and integration re ated with the OS-scheduler. Let's see what the STL maintainers have to say.

BillyONeal · 2020-01-15T22:00:43Z

stl/inc/semaphore

+    counting_semaphore& operator=(const counting_semaphore&) = delete;
+
+    void release(ptrdiff_t _Update = 1) {
+        _STL_ASSERT(_Update >= 0 && _Update <= (max() - _Counter.load()), "The semaphore has been released more times than acquired");


I don't think this assert is meaningful / useful for a concurrency primitive due to data races. (That is, just because this is true on line 45 doesn't mean it'll be true on line 46)

The standard has an Expects` clause, so I guess this should at least test the input argument.

My point is that one can't meaningfully test it because the test is stale the instant the test is done as this object is hammered from multiple threads.

stl/inc/semaphore

BillyONeal · 2020-01-15T22:13:57Z

stl/inc/semaphore

+    template <class _Clock, class _Duration>
+    bool try_acquire_until(const chrono::time_point<_Clock, _Duration>& _Abs_time) {
+        while (!try_acquire()) {
+            if (_Clock::now() > _Abs_time) {


Unfortunately I don't think we can accept an implementation like this; this proposal really needs the "efficient waiting for concurrent programs" (i.e. the WaitOnAddress-like machinery) before it can be implemented. For example, if a low priority thread holds the mutex and a high priority thread gets here, the system probably deadlocks forever (as in #370).

Also this > needs to be >=. The standardese says the timer expires "after", but if it was == when now() was called, it's after that time on the subsequent line.

Honestly, upon further reflection, you're right that the issue that holds back a meaningful implementation is the rest of the proposed paper, the atomic_notify set of functions and it's machinery are required to make a performant <semaphore>, while this is a basic naïve implementation on my part, it's nonetheless irrelevant until those semantics are availbile.

I thought about this some even more, as it turns out, even with the rest of P1135R6 as far as I can tell looking at: the N4849 PDF found here, at the moment, the sections of try_acquire_util and try_acquire_for are unimplementable without directly going to WaitOnAddress. Even still, you'd first need atomic_ref here because without it, using the normal atomic, doing the WaitOnAddress is completely undefined behavior, despite the fact that it works, so I can change the machinery to support WaitOnAddress for this on my local repository, but the fact of the matter is that as far as I can tell, the atomic_ref stuff needs implementation first to actually be able to implement the <semaphore> header.

That's not to mention the fact that I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation for ptrdiff_t will have to be made.

I prepared locally a change to my entire setup to naïvely use atomic_ref and the basic WaitOnAddress machinery (as well as using system_error and GetLastError). But I have no idea how one would properly implement the source TU area for WaitOnAddress.

Sorry if this looks like rambling, just putting my thoughts on paper for this for now.

using the normal atomic, doing the WaitOnAddress is completely undefined behavior

We own the layout of std::atomic and can reach inside its guts :)

I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation

That is a similar approach to what must be done but the fallback would need to go in its own satellite DLL here, since we don't know all the callers to a given structure will be in the same module like our existing parallel_algorithms.cpp version.

Sorry if this looks like rambling

Nothing to be sorry about :)

We own the layout of std::atomic and can reach inside its guts :)

Unless I'm seeing things incorrectly putting a pointer inside of std::atomic would be an ABI break, and if the ABI breaks again such that the first value of std::atomic isn't the type being stored or a vtable is added just taking the address of the atomic variable wouldn't work while currently it does, that's what I was concerned about here. Also since you're using the raw value instead of a pointer even reaching inside of the guts of it to grab the value wouldn't help, right? Or am I just being an idiot here?

I think that the idea here is to grab the address of the internal member variable.

Using WaitOnAddress on atomic is also possible even without reaching the meat. You'll use it on a lock-free atomic, which consists of only meat, no spinlock or anything. atomic_ref is also an option, though the standard prohibits using underlying value while atomic_ref instances exist, so it is the same level of hack as with atomic.

stl/inc/semaphore

BillyONeal · 2020-01-15T22:22:14Z

stl/inc/semaphore

+        _Counter.fetch_add(_Update, memory_order_release);
+    }
+    void acquire() {
+        while (!try_acquire()) {


Ditto I don't think this is acceptable due to busy waiting.

BillyONeal · 2020-01-15T22:22:44Z

stl/inc/semaphore

+        }
+    }
+    bool try_acquire() noexcept {
+        ptrdiff_t _Old     = _Counter.load(memory_order_acquire);


It would be better to assume the value of _Old rather than doing a load first.

How would you assume the value of _Old? I've never heard this terminology before, and a cursory google search brings up no results in-context.

Just pick something like 1. The first time through, the CAS is likely to fail; the CAS failure loads the actual value.

I think CAS is too expensive load for semaphore in general, if it is almost certainly doomed to fail. So first attempt should be load.

"Just go ahead" strategy, however, would work well for binary semaphore, when there are only 1 or 0. It will actually just be an exchange, since you always leave it to zero.

The standard just permits failing randomly on try_acquire IIRC so just using compare_exchange_weak like you did in your PR is fine, using compare_exchange_strong just strengthens the guarantee that it won't go negative AFAICT, it's been.... 6 months since I've touched this PR or thought about it really and the mindset that I had when I created it is therefore lost.

BillyONeal · 2020-01-15T22:26:24Z

stl/inc/semaphore

+    bool try_acquire() noexcept {
+        ptrdiff_t _Old     = _Counter.load(memory_order_acquire);
+        ptrdiff_t _Desired = _Old - 1;
+        if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) {


memory_order_acquire is certainly incorrect; this is publishing the reduced counter value to other threads. If you want to use weaker than sequentially consistent atomics you need to do the writeup proving that the result is correct according to the memory model.

memory_order_acquire is fine here, since only release is said to strongly happen before acquire, there's no such relationship between different acquires.

This piece is more tricky for barrier, when arrive/arrive_and_wait, where publishing reduced value indeed matters, since one of them will call completion function...

For latch too, by the way, count_down is release decrement, but arrive_and_wait is acq_rel if the value of decrement is also used to unblock immediately.

Yhea, I was mostly confused at why memory_order_acquire here was said to be wrong, esp. since the semantics of semaphore and the terminology used explicitly make clear that acquire is fine since you're performing, well, an "acquire" operation. BTW, @AlexGuteniev your PR for semaphore is remarkably similar to what my "final" design that I never actually made a PR for, for semaphore, looks like, and seems to fix all of the problems I had making it (as well as gets rid of the main problem I encounted, which was the WaitOnAddress stuff). Thanks for putting in the effort that I was too lazy to go through

Ok, this one is more complex than I anticipated originally.

See GH-1133 as the main problem, that is not resolved yet.

And also see https://stackoverflow.com/q/63215461/2945027 as another complication I encountered (but I think it is resolved in comments there).

BillyONeal · 2020-01-15T22:31:41Z

stl/inc/semaphore

+    bool try_acquire() noexcept {
+        ptrdiff_t _Old     = _Counter.load(memory_order_acquire);
+        ptrdiff_t _Desired = _Old - 1;
+        if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) {


You didn't recalculate _Desired if the CAS fails. The usual CAS loop for a subtraction would look like:

ptrdiff_t _Old = 0; ptrdiff_t _Desired; do { _Desired = _Old - 1; while (_The_atomic.compare_exchange_weak(_Old, _Desired));

though like I said without waiting / backoff this is insufficient. (I haven't thought about getting the semaphore parts connected to this but I'm not working on it right now :) )

The standard requires that if the CAS fails that we just move on for try_acquire so I didn't put a loop there in the first place since if you want blocking semantics you're supposed to go to try_acquire_for, try_acquire_until, and acquire.

This isn't a "block until the semaphore is available" loop, this is a "update the value only if another CPU didn't update it in the meantime" loop.

But wouldn't this loop still prevent try_acquire() from returning false in any condition or am I missing something while going through the atomic_compare_exchange pages?

The loop I wrote here just decrements a number to demonstrate how to use CAS, not how to implement try_acquire; the only time the loop repeats is when some other thread changed the value at the exact same time this thread did; thus it has guaranteed forward progress (that is, the only way this thread doesn't make forward progress is if some other thread made forward progress).

Or is the confusion that you missed that CAS loads the actual value into _Old?

I was confused because I thought that it made doing a proper return if false or true for the try_acquire impossible, I'm still confused how the loop allows for this at all, because we need to somehow save the result from the return of the compare_exchange_strong.

BillyONeal · 2020-01-15T22:57:10Z

Shouldn't semaphores use OS primitives when blocking instead of just busy waiting?

You would not want to use an NT semaphore because that is a kernel primitive so you pay fairly high costs each time it is acquired or released. The modern way to implement things like this is to build the waiting out of futex/WaitOnAddress-alikes.

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

This is not intended to map to any OS primitive. This is no different than std::mutex not being an NT Mutant (what comes out of CreateMutex); it would be silly to use a kernel primitive for something that only needs intra-process usermode behavior.

sylveon · 2020-01-15T23:59:43Z

This is not intended to map to any OS primitive. This is no different than std::mutex not being an NT Mutant (what comes out of CreateMutex); it would be silly to use a kernel primitive for something that only needs intra-process usermode behavior.

std::mutex currently uses SRWLOCK on Windows 7+, CRITICAL_SECTION on Vista and the ConCRT critical_section on XP, which ignoring XP can be considered OS primitives from the views of a user application. Maybe that's just bad use of the term from my part.

The only reason it can do that kind of implementation using code already provided by the OS and runtime flexibility is by entirely commenting out the constexpr that should have been on the constructor. If the standard where respected, it would make the available options much less.

BillyONeal · 2020-01-16T00:41:08Z

The only reason it can do that kind of implementation using code already provided by the OS and runtime flexibility is by entirely commenting out the constexpr

constexpr locks of any form are not possible on XP. When that got put into the standard we thought XP would have a much shorter lifespan than it ended up having.

MikeGitb · 2020-01-16T07:54:51Z

The modern way to implement things like this is to build the waiting out of futex/WaitOnAddress-alikes.

Aren't those OS primitives too? I didn't want to imply that it has to map to a windows semaphore. Just that it had to use some machinery that has ties into the OS (in particular the scheduler).

Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)?

BillyONeal

Aren't those OS primitives too?

I guess I'm just used to fielding questions from folks who are surprised that std::mutex has no relationship with CreateMutex.

Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)?

We need a fallback implementation for Vista and 7, probably similar to what we have for the parallel algorithms.

AlexGuteniev · 2020-07-28T13:57:40Z

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

Requiring constexpr is a counterpart to C++ language feature called "static initialization order fiasco".

However, construction of a general-purpose inter-thread synchronization objects is ideally not a resource allocation neither in construction, nor during synchronization. The reason is to avoid error condition that are hard to handle in synchronization code. And after removing resource allocation, adding constexpr on top is not a big deal.

Windows SRWLOCK follows that, so shared_mutex is based on it directly. Windows also One-Time initalization, which is a superset of call_once features.

latch / barrier / semaphore could have been using OS objects, if there were some. Windows has a barrier, but it is a subset of C++20 barrier. No wonder, Windows developers are not prophets :-)

Implement the semaphore header

43d7cf0

HugGa requested a review from a team as a code owner January 15, 2020 00:23

MikeGitb reviewed Jan 15, 2020

View reviewed changes

stl/inc/semaphore Outdated Show resolved Hide resolved

Use compare_exchange_strong instead of doing straight subtraction

47650e6

Use more consistent _STL_ASSERT and _Ugly identifiers

2bde7cb

BillyONeal suggested changes Jan 15, 2020

View reviewed changes

Use scoped enums and untypedefed atomics

c52bccd

Clang-format semaphore

3ea49f0

HugGa closed this Jan 15, 2020

BillyONeal reviewed Jan 17, 2020

View reviewed changes

<semaphore> Implement semaphore - WIP #420

<semaphore> Implement semaphore - WIP #420

Conversation

HugGa commented Jan 15, 2020 • edited Loading

Description

Checklist

msftclas commented Jan 15, 2020 • edited Loading

MikeGitb commented Jan 15, 2020

HugGa commented Jan 15, 2020

sylveon commented Jan 15, 2020

MikeGitb commented Jan 15, 2020

HugGa commented Jan 15, 2020

miscco commented Jan 15, 2020

MikeGitb commented Jan 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BillyONeal Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HugGa Jan 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BillyONeal commented Jan 15, 2020

sylveon commented Jan 15, 2020 • edited Loading

BillyONeal commented Jan 16, 2020

MikeGitb commented Jan 16, 2020

BillyONeal left a comment

Choose a reason for hiding this comment

AlexGuteniev commented Jul 28, 2020

HugGa commented Jan 15, 2020 •

edited

Loading

msftclas commented Jan 15, 2020 •

edited

Loading

BillyONeal Jan 16, 2020 •

edited

Loading

HugGa Jan 15, 2020 •

edited

Loading

sylveon commented Jan 15, 2020 •

edited

Loading