-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<semaphore> Implement semaphore - WIP #420
Conversation
Shouldn't semaphores use OS primitives when blocking instead of just busy waiting? |
The reason I didn't use the OS primitive is the requirement of the constructor to be constexpr, which as far as I can tell completely circumvents this possibility unless there's a LOT more to CreateSemaphore |
Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives. |
Thats a bummer. Maybe lazy construction would be a thing, but busy waiting would make this imho completely unusable. Many use cases for semaphores can block a thread for quite a long time. |
@MikeGitb would yielding the thread if the semaphore doesn't immediately acquire make this better? Maybe we should send something to the committee and ask them to add something to the standard to make that standard across all platforms? I know that's not an ideal solution (and besides, personally due to my inter-process requirements on my own stuff I roll my own semaphore class for named semaphores) but it's better than nothing here. |
You should try to better follow the conventions from the STL. This concerns the identifiers that generally do not use double underscores |
It would certainly improve the situation. Spinlocks often have a multistage backoff mechanism (first spin+pause instructions, then yield, then sleep), but I'm no expert on spinlocks. But that's kind of my point: semaphores are supposed to be more intelligent than simple spinlocks and integration re ated with the OS-scheduler. Let's see what the STL maintainers have to say. |
stl/inc/semaphore
Outdated
counting_semaphore& operator=(const counting_semaphore&) = delete; | ||
|
||
void release(ptrdiff_t _Update = 1) { | ||
_STL_ASSERT(_Update >= 0 && _Update <= (max() - _Counter.load()), "The semaphore has been released more times than acquired"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this assert is meaningful / useful for a concurrency primitive due to data races. (That is, just because this is true on line 45 doesn't mean it'll be true on line 46)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard has an Expects` clause, so I guess this should at least test the input argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that one can't meaningfully test it because the test is stale the instant the test is done as this object is hammered from multiple threads.
stl/inc/semaphore
Outdated
template <class _Clock, class _Duration> | ||
bool try_acquire_until(const chrono::time_point<_Clock, _Duration>& _Abs_time) { | ||
while (!try_acquire()) { | ||
if (_Clock::now() > _Abs_time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think we can accept an implementation like this; this proposal really needs the "efficient waiting for concurrent programs" (i.e. the WaitOnAddress-like machinery) before it can be implemented. For example, if a low priority thread holds the mutex and a high priority thread gets here, the system probably deadlocks forever (as in #370).
Also this >
needs to be >=
. The standardese says the timer expires "after", but if it was == when now() was called, it's after that time on the subsequent line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, upon further reflection, you're right that the issue that holds back a meaningful implementation is the rest of the proposed paper, the atomic_notify set of functions and it's machinery are required to make a performant <semaphore>, while this is a basic naïve implementation on my part, it's nonetheless irrelevant until those semantics are availbile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this some even more, as it turns out, even with the rest of P1135R6 as far as I can tell looking at: the N4849 PDF found here, at the moment, the sections of try_acquire_util and try_acquire_for are unimplementable without directly going to WaitOnAddress. Even still, you'd first need atomic_ref here because without it, using the normal atomic, doing the WaitOnAddress is completely undefined behavior, despite the fact that it works, so I can change the machinery to support WaitOnAddress for this on my local repository, but the fact of the matter is that as far as I can tell, the atomic_ref stuff needs implementation first to actually be able to implement the <semaphore> header.
That's not to mention the fact that I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation for ptrdiff_t will have to be made.
I prepared locally a change to my entire setup to naïvely use atomic_ref and the basic WaitOnAddress machinery (as well as using system_error and GetLastError). But I have no idea how one would properly implement the source TU area for WaitOnAddress.
Sorry if this looks like rambling, just putting my thoughts on paper for this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the normal atomic, doing the WaitOnAddress is completely undefined behavior
We own the layout of std::atomic
and can reach inside its guts :)
I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation
That is a similar approach to what must be done but the fallback would need to go in its own satellite DLL here, since we don't know all the callers to a given structure will be in the same module like our existing parallel_algorithms.cpp version.
Sorry if this looks like rambling
Nothing to be sorry about :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We own the layout of std::atomic and can reach inside its guts :)
Unless I'm seeing things incorrectly putting a pointer inside of std::atomic would be an ABI break, and if the ABI breaks again such that the first value of std::atomic isn't the type being stored or a vtable is added just taking the address of the atomic variable wouldn't work while currently it does, that's what I was concerned about here. Also since you're using the raw value instead of a pointer even reaching inside of the guts of it to grab the value wouldn't help, right? Or am I just being an idiot here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the idea here is to grab the address of the internal member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using WaitOnAddress
on atomic is also possible even without reaching the meat. You'll use it on a lock-free atomic, which consists of only meat, no spinlock or anything. atomic_ref
is also an option, though the standard prohibits using underlying value while atomic_ref
instances exist, so it is the same level of hack as with atomic
.
_Counter.fetch_add(_Update, memory_order_release); | ||
} | ||
void acquire() { | ||
while (!try_acquire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto I don't think this is acceptable due to busy waiting.
stl/inc/semaphore
Outdated
} | ||
} | ||
bool try_acquire() noexcept { | ||
ptrdiff_t _Old = _Counter.load(memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to assume the value of _Old rather than doing a load first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you assume the value of _Old? I've never heard this terminology before, and a cursory google search brings up no results in-context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pick something like 1. The first time through, the CAS is likely to fail; the CAS failure loads the actual value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CAS is too expensive load for semaphore in general, if it is almost certainly doomed to fail. So first attempt should be load.
"Just go ahead" strategy, however, would work well for binary semaphore, when there are only 1 or 0. It will actually just be an exchange, since you always leave it to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard just permits failing randomly on try_acquire IIRC so just using compare_exchange_weak like you did in your PR is fine, using compare_exchange_strong just strengthens the guarantee that it won't go negative AFAICT, it's been.... 6 months since I've touched this PR or thought about it really and the mindset that I had when I created it is therefore lost.
stl/inc/semaphore
Outdated
bool try_acquire() noexcept { | ||
ptrdiff_t _Old = _Counter.load(memory_order_acquire); | ||
ptrdiff_t _Desired = _Old - 1; | ||
if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_acquire
is certainly incorrect; this is publishing the reduced counter value to other threads. If you want to use weaker than sequentially consistent atomics you need to do the writeup proving that the result is correct according to the memory model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_acquire
is fine here, since only release
is said to strongly happen before acquire
, there's no such relationship between different acquires
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece is more tricky for barrier
, when arrive
/arrive_and_wait
, where publishing reduced value indeed matters, since one of them will call completion function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For latch
too, by the way, count_down
is release decrement, but arrive_and_wait
is acq_rel if the value of decrement is also used to unblock immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yhea, I was mostly confused at why memory_order_acquire here was said to be wrong, esp. since the semantics of semaphore and the terminology used explicitly make clear that acquire is fine since you're performing, well, an "acquire" operation. BTW, @AlexGuteniev your PR for semaphore is remarkably similar to what my "final" design that I never actually made a PR for, for semaphore, looks like, and seems to fix all of the problems I had making it (as well as gets rid of the main problem I encounted, which was the WaitOnAddress stuff). Thanks for putting in the effort that I was too lazy to go through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this one is more complex than I anticipated originally.
See GH-1133 as the main problem, that is not resolved yet.
And also see https://stackoverflow.com/q/63215461/2945027 as another complication I encountered (but I think it is resolved in comments there).
stl/inc/semaphore
Outdated
bool try_acquire() noexcept { | ||
ptrdiff_t _Old = _Counter.load(memory_order_acquire); | ||
ptrdiff_t _Desired = _Old - 1; | ||
if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't recalculate _Desired
if the CAS fails. The usual CAS loop for a subtraction would look like:
ptrdiff_t _Old = 0;
ptrdiff_t _Desired;
do {
_Desired = _Old - 1;
while (_The_atomic.compare_exchange_weak(_Old, _Desired));
though like I said without waiting / backoff this is insufficient. (I haven't thought about getting the semaphore parts connected to this but I'm not working on it right now :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard requires that if the CAS fails that we just move on for try_acquire so I didn't put a loop there in the first place since if you want blocking semantics you're supposed to go to try_acquire_for, try_acquire_until, and acquire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a "block until the semaphore is available" loop, this is a "update the value only if another CPU didn't update it in the meantime" loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't this loop still prevent try_acquire() from returning false in any condition or am I missing something while going through the atomic_compare_exchange pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop I wrote here just decrements a number to demonstrate how to use CAS, not how to implement try_acquire; the only time the loop repeats is when some other thread changed the value at the exact same time this thread did; thus it has guaranteed forward progress (that is, the only way this thread doesn't make forward progress is if some other thread made forward progress).
Or is the confusion that you missed that CAS loads the actual value into _Old
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused because I thought that it made doing a proper return if false or true for the try_acquire impossible, I'm still confused how the loop allows for this at all, because we need to somehow save the result from the return of the compare_exchange_strong.
You would not want to use an NT semaphore because that is a kernel primitive so you pay fairly high costs each time it is acquired or released. The modern way to implement things like this is to build the waiting out of
This is not intended to map to any OS primitive. This is no different than |
The only reason it can do that kind of implementation using code already provided by the OS and runtime flexibility is by entirely commenting out the |
|
Aren't those OS primitives too? I didn't want to imply that it has to map to a windows semaphore. Just that it had to use some machinery that has ties into the OS (in particular the scheduler). Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't those OS primitives too?
I guess I'm just used to fielding questions from folks who are surprised that std::mutex
has no relationship with CreateMutex
.
Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)?
We need a fallback implementation for Vista and 7, probably similar to what we have for the parallel algorithms.
Requiring However, construction of a general-purpose inter-thread synchronization objects is ideally not a resource allocation neither in construction, nor during synchronization. The reason is to avoid error condition that are hard to handle in synchronization code. And after removing resource allocation, adding Windows
|
Description
This is an implementation of counting_semaphore and binary_semaphore (See: #52 )
I made a basic implementation of the semaphore header, using an atomic_ptrdiff_t instead of the given ptrdiff_t, my reasoning for this was that the semaphore library currently lists the counting_semaphore as doing atomic operations and the only non-mutex way I could think of doing atomic operations here. I'm definitely willing to change it to be something else if required.
I have also tested this (slightly) on my home computer, but have not really ultra-stress tested it other than bleak fooling around.
A question that I have for the implementation itself: what value should be the ultimate hard-cap for the max function? I currently have it set to just copy the value pushed in from _LeastMaxValue but I know that there is a maximum value out there, so should the function just make sure that the value of _LeastMaxValue is above 0 via static_assert and use that or have some sort of hard-cap (PTRDIFF_MAX?) for this.
Checklist
Be sure you've read README.md and understand the scope of this repo.
If you're unsure about a box, leave it unchecked. A maintainer will help you.
_Ugly
as perhttps://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
verified by an STL maintainer before automated testing is enabled on GitHub,
leave this unchecked for initial submission).
members, adding virtual functions, changing whether a type is an aggregate
or trivially copyable, etc.).
the C++ Working Draft (including any cited standards), other WG21 papers
(excluding reference implementations outside of proposed standard wording),
and LWG issues as reference material. If they were derived from a project
that's already listed in NOTICE.txt, that's fine, but please mention it.
If they were derived from any other project (including Boost and libc++,
which are not yet listed in NOTICE.txt), you must mention it here,
so we can determine whether the license is compatible and what else needs
to be done.