-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesigning the MultiThreadedExecutor #1276
Conversation
Signed-off-by: Audrow <[email protected]>
Signed-off-by: Audrow <[email protected]>
Signed-off-by: Audrow <[email protected]>
Signed-off-by: Audrow Nash <[email protected]> Co-authored-by: Geoffrey Biggs <[email protected]>
15cb8cf
to
e46398f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for triggering the discussion @audrow !
} | ||
|
||
:Advantages: | ||
* no spurious wake ups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, this may avoid spurious wake ups.
But for example, it doesn't fix the problem for timers (though, it does fix the problem in all waitables were "taking data" has a meaning).
I still like the idea of somehow "taking the data" early.
I don't think that "taking data" is conceptually correct, more on that in a different comment.
For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).
Should this be another solution listed? Would this work for all entity types? If so, could you briefly describe its implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be another solution listed?
Distinguishing between entities that can be executed concurrently (e.g. you can take data from a subscription from different threads) from the ones that cannot sounds like right to me.
Would this work for all entity types?
In entities like timer
, waiting on it to be ready on one thread while scheduling it for execution in another thread is a problem.
For other entities (e.g. subscriptions), that would only worsen performance.
If so, could you briefly describe its implementation?
We could have a flag or something that differentiates between "concurrent" and "non-concurrent" entities.
We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.
Currently when an entity on a ME callback group is being executed, all the entities in the group (including the one scheduled for execution) cannot be waited on in another thread neither be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.
How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).
AFAIU:
- Callbacks from different callback groups can be executed concurrently.
- Callbacks from the same Reentrant Callback Group (RCG) can be executed concurrently.
- Callbacks from the same Mutually Exclusive Callback Group (MECG) cannot be executed concurrently.
For entities that aren't reentrant (e.g. timers) we could do the following:
- By default each timer is added to a new MECG that only contains that timer, and that MECG will not be used for other entities later.
- The user can pass a custom MECG. Passing a RCG is not allowed.
I think that doesn't break user assumptions. If it does break them, I'm probably miss-understanding how callback groups work.
This might not be the ideal way of implementing non-reentrant entities, maybe having an ad-hoc flag is a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.
If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).
Putting ROS 1 aside, if you right now create a timer and a subscription, and then spin in a multi-threaded executor they are mutually exclusive, your proposed change would change that behavior, breaking their expectation.
Either way, I don't see how this fixes the problem, but that's likely due to me loosing context on the conversation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.
We have a race condition when trying to wait on timers in a thread while another thread has scheduled its execution, and sometimes the timer is executed twice in a row (here a failure in CI, but it's almost impossible to reproduce the failure, it's extremely flaky).
There's a patch in the executor trying to avoid the race condition, but I think the patch suffers from a race condition too as explained here.
The approach proposed in #1241 doesn't solve the problem and I proposed a custom workaround in #1275.
I think that having the concept of entities that cannot be executed reentrantly could also solve the problem, and I think that's particular suitable for timers.
If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).
Sorry, I thought that the default callback group was reentrant and not ME (code here). The change I proposed wouldn't be acceptable.
void take_data(std::shared_ptr<void> & data) { | ||
data = buffer->consume_data(); | ||
} | ||
|
||
void execute(std::shared_ptr<void> & data) { | ||
run_any_callback(data_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is how data is type erased here.
I don't think that's a good idea for a user facing API.
I think it's a better idea to have something like:
std::function<void()> get_handle() {
return [data=buffer->consume_data()]() {
run_any_callback(data);
};
}
In that way, data is not type erased.
IIRC @wjwwood raised the concern that this approach forces entities to have a way of "handling" the data, and it would be nice in the future to be able to create, for example, a Subscription
where you can directly take the data.
Though that would be nice, all our current API is based on callbacks.
i.e.: you need to actually pass a callback when creating a Subscription
/etc.
If we want to add support to a WaitSet+take approach (which I think it would be nice for some users), we will actually need to split entities e.g. Subscription
and ExecutableSubscription
.
In that approach, I think that waitables actually don't need to have the concept of "taking data" neither.
You only need to be able to check if the waitable is ready, and then you can execute whatever code is needed.
Longer term:
I think we should separate the concept of Waitable/Condition from the concept of Executable.
The current design mixes both, and they aren't necessarily the same.
I would like to write how I think that can be done, but I don't think that's the goal of this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't erase the type in the interface, I'm convinced that #1275 is a good idea (each time I read the PR I'm a bit confused, but I think I'm convinced now 😂).
Without that, I still don't like the proposal much (sorry for being a bit hardheaded).
Maybe there's a different way of avoiding erasing the type, instead of my get_handle
idea. I would like to hear of other alternatives!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the get_handle
method in with an example. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still convinced #1275 is not the right solution, also sorry for being hardheaded, and I don't have an issue with the type erasure, since the root cause is that we have to deal with type erased data at the the rcl
/rmw
layer anyways.
If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr<void> & data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr & data.
The difference is that you don't expose to the user a void *
.
|
||
* Add a preparation method to get the event after :code:`take_data` | ||
* Take a lambda that can be executed later | ||
* Store conditions in the waitset (like DDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having the concept of conditions (like DDS) is nice, but I'm not sure what this item means exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke about it in our meeting. I'm not sure that I understand it well enough to explain.
It was something like, we could do what DDS does, which is use conditions that can be peaked at or read. Once a condition is read, it is removed from the queue, so they don't come up again. I believe this would avoid spurious wakeups. @wjwwood, is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in a ROS 2 wait set you can add different kind of entities (subscription, service, ...).
In DDS
, you can only add conditions to a WaitSet.
Conditions API is quite simple, you can only check if them are ready or dispatch an attached callback.
There's no concept of "reading" or "peaking" a condition AFAIU.
I think that having something similar to DDS conditions will simplify our custom handling of different cases in wait set and executor code a lot (and the abstractions looks nice too 😃 ).
Signed-off-by: Audrow Nash <[email protected]>
Signed-off-by: Audrow Nash <[email protected]>
Signed-off-by: Audrow Nash <[email protected]>
Signed-off-by: Audrow Nash <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid spurious wake-up is really nice and cost effective. this can also fix the ros2/ros2#1035 and #1399.
I'm going to close this because there hasn't been movement on this in a few years. Feel free to open this if it becomes relevant again. |
Adds a document for discussing how the
MultiThreadedExecutor
should be redesigned to avoid calling execute more than one time on one executable object. This discussion was prompted by #1241 and #1275.