Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<semaphore> Implement semaphore - WIP #420

Closed
wants to merge 5 commits into from
Closed

Conversation

HugGa
Copy link

@HugGa HugGa commented Jan 15, 2020

Description

This is an implementation of counting_semaphore and binary_semaphore (See: #52 )

I made a basic implementation of the semaphore header, using an atomic_ptrdiff_t instead of the given ptrdiff_t, my reasoning for this was that the semaphore library currently lists the counting_semaphore as doing atomic operations and the only non-mutex way I could think of doing atomic operations here. I'm definitely willing to change it to be something else if required.

I have also tested this (slightly) on my home computer, but have not really ultra-stress tested it other than bleak fooling around.

A question that I have for the implementation itself: what value should be the ultimate hard-cap for the max function? I currently have it set to just copy the value pushed in from _LeastMaxValue but I know that there is a maximum value out there, so should the function just make sure that the value of _LeastMaxValue is above 0 via static_assert and use that or have some sort of hard-cap (PTRDIFF_MAX?) for this.

Checklist

Be sure you've read README.md and understand the scope of this repo.

If you're unsure about a box, leave it unchecked. A maintainer will help you.

  • Identifiers in product code changes are properly _Ugly as per
    https://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
  • The STL builds successfully and all tests have passed (must be manually
    verified by an STL maintainer before automated testing is enabled on GitHub,
    leave this unchecked for initial submission).
  • These changes introduce no known ABI breaks (adding members, renaming
    members, adding virtual functions, changing whether a type is an aggregate
    or trivially copyable, etc.).
  • These changes were written from scratch using only this repository,
    the C++ Working Draft (including any cited standards), other WG21 papers
    (excluding reference implementations outside of proposed standard wording),
    and LWG issues as reference material. If they were derived from a project
    that's already listed in NOTICE.txt, that's fine, but please mention it.
    If they were derived from any other project (including Boost and libc++,
    which are not yet listed in NOTICE.txt), you must mention it here,
    so we can determine whether the license is compatible and what else needs
    to be done.

@HugGa HugGa requested a review from a team as a code owner January 15, 2020 00:23
@msftclas
Copy link

msftclas commented Jan 15, 2020

CLA assistant check
All CLA requirements met.

@MikeGitb
Copy link

Shouldn't semaphores use OS primitives when blocking instead of just busy waiting?

stl/inc/semaphore Outdated Show resolved Hide resolved
@HugGa
Copy link
Author

HugGa commented Jan 15, 2020

The reason I didn't use the OS primitive is the requirement of the constructor to be constexpr, which as far as I can tell completely circumvents this possibility unless there's a LOT more to CreateSemaphore
and CreateSemaphoreEX that I don't know about or that there's a constexpr way of evaluating such a thing.

@sylveon
Copy link
Contributor

sylveon commented Jan 15, 2020

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

@MikeGitb
Copy link

Thats a bummer. Maybe lazy construction would be a thing, but busy waiting would make this imho completely unusable. Many use cases for semaphores can block a thread for quite a long time.

@HugGa
Copy link
Author

HugGa commented Jan 15, 2020

@MikeGitb would yielding the thread if the semaphore doesn't immediately acquire make this better? Maybe we should send something to the committee and ask them to add something to the standard to make that standard across all platforms? I know that's not an ideal solution (and besides, personally due to my inter-process requirements on my own stuff I roll my own semaphore class for named semaphores) but it's better than nothing here.

@miscco
Copy link
Contributor

miscco commented Jan 15, 2020

You should try to better follow the conventions from the STL.

This concerns the identifiers that generally do not use double underscores __foo vs _Foo.
_STL_ASSERT is used rather than asserts and when possible you should const qualify function arguments in new code.

@MikeGitb
Copy link

@MikeGitb would yielding the thread if the semaphore doesn't immediately acquire make this better

It would certainly improve the situation. Spinlocks often have a multistage backoff mechanism (first spin+pause instructions, then yield, then sleep), but I'm no expert on spinlocks. But that's kind of my point: semaphores are supposed to be more intelligent than simple spinlocks and integration re ated with the OS-scheduler. Let's see what the STL maintainers have to say.

counting_semaphore& operator=(const counting_semaphore&) = delete;

void release(ptrdiff_t _Update = 1) {
_STL_ASSERT(_Update >= 0 && _Update <= (max() - _Counter.load()), "The semaphore has been released more times than acquired");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this assert is meaningful / useful for a concurrency primitive due to data races. (That is, just because this is true on line 45 doesn't mean it'll be true on line 46)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard has an Expects` clause, so I guess this should at least test the input argument.

Copy link
Member

@BillyONeal BillyONeal Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that one can't meaningfully test it because the test is stale the instant the test is done as this object is hammered from multiple threads.

stl/inc/semaphore Outdated Show resolved Hide resolved
template <class _Clock, class _Duration>
bool try_acquire_until(const chrono::time_point<_Clock, _Duration>& _Abs_time) {
while (!try_acquire()) {
if (_Clock::now() > _Abs_time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think we can accept an implementation like this; this proposal really needs the "efficient waiting for concurrent programs" (i.e. the WaitOnAddress-like machinery) before it can be implemented. For example, if a low priority thread holds the mutex and a high priority thread gets here, the system probably deadlocks forever (as in #370).

Also this > needs to be >=. The standardese says the timer expires "after", but if it was == when now() was called, it's after that time on the subsequent line.

Copy link
Author

@HugGa HugGa Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, upon further reflection, you're right that the issue that holds back a meaningful implementation is the rest of the proposed paper, the atomic_notify set of functions and it's machinery are required to make a performant <semaphore>, while this is a basic naïve implementation on my part, it's nonetheless irrelevant until those semantics are availbile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this some even more, as it turns out, even with the rest of P1135R6 as far as I can tell looking at: the N4849 PDF found here, at the moment, the sections of try_acquire_util and try_acquire_for are unimplementable without directly going to WaitOnAddress. Even still, you'd first need atomic_ref here because without it, using the normal atomic, doing the WaitOnAddress is completely undefined behavior, despite the fact that it works, so I can change the machinery to support WaitOnAddress for this on my local repository, but the fact of the matter is that as far as I can tell, the atomic_ref stuff needs implementation first to actually be able to implement the <semaphore> header.

That's not to mention the fact that I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation for ptrdiff_t will have to be made.

I prepared locally a change to my entire setup to naïvely use atomic_ref and the basic WaitOnAddress machinery (as well as using system_error and GetLastError). But I have no idea how one would properly implement the source TU area for WaitOnAddress.

Sorry if this looks like rambling, just putting my thoughts on paper for this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the normal atomic, doing the WaitOnAddress is completely undefined behavior

We own the layout of std::atomic and can reach inside its guts :)

I see what you're doing with parallel_algorithms.cpp there, telling me that a new .cpp file will have to be made to somehow check for various support and a new WaitOnAddress shim implementation

That is a similar approach to what must be done but the fallback would need to go in its own satellite DLL here, since we don't know all the callers to a given structure will be in the same module like our existing parallel_algorithms.cpp version.

Sorry if this looks like rambling

Nothing to be sorry about :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We own the layout of std::atomic and can reach inside its guts :)

Unless I'm seeing things incorrectly putting a pointer inside of std::atomic would be an ABI break, and if the ABI breaks again such that the first value of std::atomic isn't the type being stored or a vtable is added just taking the address of the atomic variable wouldn't work while currently it does, that's what I was concerned about here. Also since you're using the raw value instead of a pointer even reaching inside of the guts of it to grab the value wouldn't help, right? Or am I just being an idiot here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the idea here is to grab the address of the internal member variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using WaitOnAddress on atomic is also possible even without reaching the meat. You'll use it on a lock-free atomic, which consists of only meat, no spinlock or anything. atomic_ref is also an option, though the standard prohibits using underlying value while atomic_ref instances exist, so it is the same level of hack as with atomic.

stl/inc/semaphore Outdated Show resolved Hide resolved
_Counter.fetch_add(_Update, memory_order_release);
}
void acquire() {
while (!try_acquire()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto I don't think this is acceptable due to busy waiting.

}
}
bool try_acquire() noexcept {
ptrdiff_t _Old = _Counter.load(memory_order_acquire);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to assume the value of _Old rather than doing a load first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you assume the value of _Old? I've never heard this terminology before, and a cursory google search brings up no results in-context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pick something like 1. The first time through, the CAS is likely to fail; the CAS failure loads the actual value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CAS is too expensive load for semaphore in general, if it is almost certainly doomed to fail. So first attempt should be load.

"Just go ahead" strategy, however, would work well for binary semaphore, when there are only 1 or 0. It will actually just be an exchange, since you always leave it to zero.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard just permits failing randomly on try_acquire IIRC so just using compare_exchange_weak like you did in your PR is fine, using compare_exchange_strong just strengthens the guarantee that it won't go negative AFAICT, it's been.... 6 months since I've touched this PR or thought about it really and the mindset that I had when I created it is therefore lost.

bool try_acquire() noexcept {
ptrdiff_t _Old = _Counter.load(memory_order_acquire);
ptrdiff_t _Desired = _Old - 1;
if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_acquire is certainly incorrect; this is publishing the reduced counter value to other threads. If you want to use weaker than sequentially consistent atomics you need to do the writeup proving that the result is correct according to the memory model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_acquire is fine here, since only release is said to strongly happen before acquire, there's no such relationship between different acquires.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece is more tricky for barrier, when arrive/arrive_and_wait, where publishing reduced value indeed matters, since one of them will call completion function...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For latch too, by the way, count_down is release decrement, but arrive_and_wait is acq_rel if the value of decrement is also used to unblock immediately.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yhea, I was mostly confused at why memory_order_acquire here was said to be wrong, esp. since the semantics of semaphore and the terminology used explicitly make clear that acquire is fine since you're performing, well, an "acquire" operation. BTW, @AlexGuteniev your PR for semaphore is remarkably similar to what my "final" design that I never actually made a PR for, for semaphore, looks like, and seems to fix all of the problems I had making it (as well as gets rid of the main problem I encounted, which was the WaitOnAddress stuff). Thanks for putting in the effort that I was too lazy to go through

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this one is more complex than I anticipated originally.

See GH-1133 as the main problem, that is not resolved yet.

And also see https://stackoverflow.com/q/63215461/2945027 as another complication I encountered (but I think it is resolved in comments there).

bool try_acquire() noexcept {
ptrdiff_t _Old = _Counter.load(memory_order_acquire);
ptrdiff_t _Desired = _Old - 1;
if (_Old > 0 && _Counter.compare_exchange_strong(_Old, _Desired, memory_order_acquire)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't recalculate _Desired if the CAS fails. The usual CAS loop for a subtraction would look like:

ptrdiff_t _Old = 0;
ptrdiff_t _Desired;
do {
    _Desired = _Old - 1;
while (_The_atomic.compare_exchange_weak(_Old, _Desired));

though like I said without waiting / backoff this is insufficient. (I haven't thought about getting the semaphore parts connected to this but I'm not working on it right now :) )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard requires that if the CAS fails that we just move on for try_acquire so I didn't put a loop there in the first place since if you want blocking semantics you're supposed to go to try_acquire_for, try_acquire_until, and acquire.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "block until the semaphore is available" loop, this is a "update the value only if another CPU didn't update it in the meantime" loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't this loop still prevent try_acquire() from returning false in any condition or am I missing something while going through the atomic_compare_exchange pages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop I wrote here just decrements a number to demonstrate how to use CAS, not how to implement try_acquire; the only time the loop repeats is when some other thread changed the value at the exact same time this thread did; thus it has guaranteed forward progress (that is, the only way this thread doesn't make forward progress is if some other thread made forward progress).

Or is the confusion that you missed that CAS loads the actual value into _Old?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused because I thought that it made doing a proper return if false or true for the try_acquire impossible, I'm still confused how the loop allows for this at all, because we need to somehow save the result from the return of the compare_exchange_strong.

@BillyONeal
Copy link
Member

Shouldn't semaphores use OS primitives when blocking instead of just busy waiting?

You would not want to use an NT semaphore because that is a kernel primitive so you pay fairly high costs each time it is acquired or released. The modern way to implement things like this is to build the waiting out of futex/WaitOnAddress-alikes.

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

This is not intended to map to any OS primitive. This is no different than std::mutex not being an NT Mutant (what comes out of CreateMutex); it would be silly to use a kernel primitive for something that only needs intra-process usermode behavior.

@HugGa HugGa closed this Jan 15, 2020
@sylveon
Copy link
Contributor

sylveon commented Jan 15, 2020

This is not intended to map to any OS primitive. This is no different than std::mutex not being an NT Mutant (what comes out of CreateMutex); it would be silly to use a kernel primitive for something that only needs intra-process usermode behavior.

std::mutex currently uses SRWLOCK on Windows 7+, CRITICAL_SECTION on Vista and the ConCRT critical_section on XP, which ignoring XP can be considered OS primitives from the views of a user application. Maybe that's just bad use of the term from my part.

The only reason it can do that kind of implementation using code already provided by the OS and runtime flexibility is by entirely commenting out the constexpr that should have been on the constructor. If the standard where respected, it would make the available options much less.

@BillyONeal
Copy link
Member

The only reason it can do that kind of implementation using code already provided by the OS and runtime flexibility is by entirely commenting out the constexpr

constexpr locks of any form are not possible on XP. When that got put into the standard we thought XP would have a much shorter lifespan than it ended up having.

@MikeGitb
Copy link

The modern way to implement things like this is to build the waiting out of futex/WaitOnAddress-alikes.

Aren't those OS primitives too? I didn't want to imply that it has to map to a windows semaphore. Just that it had to use some machinery that has ties into the OS (in particular the scheduler).

Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)?

Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't those OS primitives too?

I guess I'm just used to fielding questions from folks who are surprised that std::mutex has no relationship with CreateMutex.

Btw. doesn't WaitOnAdress imply no win7 support (not that I would have a problem with that personally)?

We need a fallback implementation for Vista and 7, probably similar to what we have for the parallel algorithms.

@AlexGuteniev
Copy link
Contributor

Welp that's a good way to force every standard library to roll their own instead of using tried and true OS primitives.

Requiring constexpr is a counterpart to C++ language feature called "static initialization order fiasco".

However, construction of a general-purpose inter-thread synchronization objects is ideally not a resource allocation neither in construction, nor during synchronization. The reason is to avoid error condition that are hard to handle in synchronization code. And after removing resource allocation, adding constexpr on top is not a big deal.

Windows SRWLOCK follows that, so shared_mutex is based on it directly. Windows also One-Time initalization, which is a superset of call_once features.

latch / barrier / semaphore could have been using OS objects, if there were some. Windows has a barrier, but it is a subset of C++20 barrier. No wonder, Windows developers are not prophets :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants