Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional items for the ROS 2 Galactic Roadmap #901

Open
dejanpan opened this issue Nov 13, 2020 · 19 comments
Open

Additional items for the ROS 2 Galactic Roadmap #901

dejanpan opened this issue Nov 13, 2020 · 19 comments
Assignees

Comments

@dejanpan
Copy link

dejanpan commented Nov 13, 2020

We are currently straight porting one rather large project (160+ K lines of code) from ROS 1 to ROS 2. In the process of doing that we encountered several features not yet available in ROS 2: https://github.com/tier4/AutowareArchitectureProposal/blob/ros2/docs/porting-to-ROS2.md.

I'd like that we add some of these features to the agenda for ROS 2 G if possible:

  1. ROS 2 Timers
    1. There is already an issue for this: Extend Time implementation into Timer and Rate objects rclcpp#465
  2. Synchronous service clients
  3. topic_tools
    1. This work is already in progress and you can assign it to Apex.AI
      1. https://github.com/ApexAI/topic_tools
      2. https://github.com/ApexAI/rclcpp_generic
  4. IDL type evolution
    1. For being able to reuse the old bags with older version of msg/idl definition.
    2. ROS 1 has rosbag/migration
    3. DDS has Extensible and Dynamic Topic Types
  5. Reading multi-type parameters as an array from the yaml file via the ros2 parameters interface. In ROS1, XmlRpc was used.
    1. The specific example can be seen in system_monitor package. Parameters for the hdd monitor accepted a set of parameters for an arbitrary number of disks. These parameters used to be read using XmlRpc.
    2. As a workaround, rather than string parsing, we used manually indexed prefixes in the config file and changed the parameter reading logic accordingly.

@clalancette I believe that this is best directed to you.

@clalancette
Copy link
Contributor

In general, feel free to open a pull request against https://github.com/ros2/ros2_documentation/blob/master/source/Roadmap.rst and tag me; I'm happy to review it.

I will tell you that I have serious reservations about item number 2 there. I think it was a mistake to introduce the synchronous service client into Python, and I don't think we should replicate that mistake into rclcpp. But we can discuss it on the PR.

@nnmm
Copy link
Contributor

nnmm commented Nov 17, 2020

@clalancette If you don't mind me already trying to clarify point 2 (I ran into this problem while porting and passed it on to Dejan): Do you mean the problems mentioned in https://index.ros.org/doc/ros2/Tutorials/Sync-Vs-Async/?

I think a nice asynchronous future-based API would make everyone happy as well.¹ The problem with the existing API is that while it returns a future, you can't really seem to retrieve its result if you're inside a node (rclcpp::spin_until_future_complete() throws, wait() never returns the result). I had assumed this was a known limitation and not a bug since the service example works around this by making the service call outside a node. So this is what's missing: The ability to call a service from e.g. a subscription callback and doing so without cumbersome and undocumented workarounds², only using wait_for().

¹ My impression of a callback API in contrast is that it makes it hard to control timing and to handle errors.
² I'm just mentioning that since I haven't managed to do it at all, but it's possible there is some way to call a service from inside a node. If there is, it's not documented – the service example linked above and this explainer do not describe what to do for that use case at all.

@clalancette
Copy link
Contributor

@clalancette If you don't mind me already trying to clarify point 2 (I ran into this problem while porting and passed it on to Dejan): Do you mean the problems mentioned in https://index.ros.org/doc/ros2/Tutorials/Sync-Vs-Async/?

Yep, those are exactly the problems I'm hoping to avoid. It's just too easy for users to call a synchronous service call inside of the callbacks and then cause a deadlock.

I think a nice asynchronous future-based API would make everyone happy as well.¹ The problem with the existing API is that while it returns a future, you can't really seem to retrieve its result if you're inside a node (rclcpp::spin_until_future_complete() throws, wait() never returns the result). I had assumed this was a known limitation and not a bug since the service example works around this by making the service call outside a node. So this is what's missing: The ability to call a service from e.g. a subscription callback and doing so without cumbersome and undocumented workarounds², only using wait_for().

It is definitely possible to call a service from inside a node. There's an example of doing just this in https://github.com/ros2/demos/blob/master/demo_nodes_cpp/src/services/add_two_ints_client_async.cpp . That solution uses a callback to get the result when it is ready. The other way to do it is to store the future as returned from async_send_request, and then spin off a thread to wait for it to finish (you can also use std::async, which is more-or-less syntactic sugar over that).

You are definitely correct that our documentation doesn't go over all of this very well. So that's a place where we should improve.

@nnmm
Copy link
Contributor

nnmm commented Nov 17, 2020

Yes, the callback works, but I am interested in a version without callbacks. Please correct me if I'm wrong, but I see a callback-based service-client model as basically the same as two pub-sub pairs (request and response), with most of the associated problems. You send off the request, but if e.g. there is no service running, your callback just will never be called, so you can't really handle that kind of error. You can't set a limit on how long the call may take. The order of responses for two service calls becomes unpredictable. Between the request and the callback, the state of your node may have totally changed. You probably know all of this and agree that straight-line code is just much easier to reason about. For me, "straight-line code" doesn't need to be synchronous, it can also mean a future that you wait on for a certain amount of time, and that's what I meant by "nice asynchronous future-based API". Basically this:

auto res = client_->async_send_request(...);
if (res.wait_for(10ms) == std::future_status::ready) {
  ... res.get() ...
}

Does that make sense? So the open questions for me are

  1. Does ROS2 aim to provide a way to write service calls as straight-line code? If not, I'd argue that's a feature many people expect, coming from ROS1 and other RPC frameworks, and I (perhaps ignorantly) don't see how a callback approach adds much to the existing pub/sub mechanism.
  2. If yes, is that currently possible? I haven't managed yet, see the code below.
  3. And also if the answer to 1 is yes, can ROS2 provide a "nice" way of doing that, i.e. not having to do extra threads and so on, ideally just the above wait_for()?

Regarding point 2, you pointed out

store the future as returned from async_send_request, and then spin off a thread to wait for it to finish

I tried the following, and it only printed the first log message. Do you mean some other way of waiting?

auto result_future = client_->async_send_request(request);
AddTwoInts::Response::SharedPtr result;
std::thread wait_thread([this, &result_future, &result](){
  RCLCPP_INFO(this->get_logger(), "In thread");
  auto status = result_future.wait_for(1000ms);
  RCLCPP_INFO(this->get_logger(), "Finished waiting");
});
wait_thread.join();

Or is the issue that you need to return from the function that made the request before you can wait (which means no straight-line code)?

@clalancette
Copy link
Contributor

Yes, the callback works, but I am interested in a version without callbacks. Please correct me if I'm wrong, but I see a callback-based service-client model as basically the same as two pub-sub pairs (request and response), with most of the associated problems.

In point of fact, that is exactly how it is implemented. I agree that it has some problems, but:

You send off the request, but if e.g. there is no service running, your callback just will never be called, so you can't really handle that kind of error.

We do have service_is_ready, so you can tell if any server is available to take your call. That is subject to a TOCTTOU race, of course.

You can't set a limit on how long the call may take.

You can if you use the future and a timeout on the future. But you are right, you can't do it with the callback. That might be a feature we could add, though I'd have to think about it a bit more.

The order of responses for two service calls becomes unpredictable.

That's true; if I make two calls back-to-back, they both arrive at the destination at about the same time, but the second one completes first, I'll get the second response first. But that seems like the service call is being misused in that case. In the "straight-line" code, you'd do:

future = client->async_send_request()
future.wait_for(1000ms);
future = client2->async_send_request()
future.wait_for(1000ms);

In the callback case, you'd do:

auto cb2 = [] (SharedFuture future) { if (future->get() == SUCCESS) printf("Success\n"); };
auto cb1 = [] (SharedFuture future) { if (future->get() == SUCCESS) client->async_send_request(cb2); };
client->async_send_request(cb1);

I'm not arguing that the second case is clearer, but I am arguing that you can achieve the same semantics.

Between the request and the callback, the state of your node may have totally changed.

While that is true, that is true of anything that takes a substantial amount of time to complete. So I'm not sure that this is an argument against a callback mechanism.

Or is the issue that you need to return from the function that made the request before you can wait (which means no straight-line code)?

Yes, that's exactly the problem with that code. Fundamentally, in the default case, you have a single threaded executor calling callbacks as events happen. Within those callbacks, you can't do things that wait on further callbacks coming in; you are blocked inside of the callback.

The way around it is of course threading, either by hand (std::thread), or by using the MultiThreadedExecutor. In the latter case, you can actually write the straight-line code you want, since you'll have other threads available to run callbacks for you.

1. Does ROS2 aim to provide a way to write service calls as straight-line code? If not, I'd argue that's a feature many people expect, coming from ROS1 and other RPC frameworks, and I (perhaps ignorantly) don't see how a callback approach adds much to the existing pub/sub mechanism.

Here's the thing about service calls as straight-line code. Let's assume for a second that we can solve the deadlock case, and you could call service calls "straight-line" from callbacks. I still don't think that is a good idea, because now your callbacks can take arbitrary amounts of time, and now you are potentially starving the rest of callbacks in the system while you are waiting for your service result.

@nnmm
Copy link
Contributor

nnmm commented Nov 17, 2020

I can answer in more detail tomorrow, but I wanted to say thanks for troubling yourself to answer so quickly but also clearly.

For now, let me just add one comment on this:

Let's assume for a second that we can solve the deadlock case, and you could call service calls "straight-line" from callbacks. I still don't think that is a good idea, because now your callbacks can take arbitrary amounts of time, and now you are potentially starving the rest of callbacks in the system while you are waiting for your service result.

Isn't the question of straight-line/callback style independent from whether timeouts are used? I think we agree that any good solution should discourage waiting indefinitely.

dejanpan added a commit to dejanpan/ros2_documentation that referenced this issue Nov 18, 2020
For full description and discussions see ros2#901.
@dejanpan
Copy link
Author

dejanpan commented Nov 18, 2020

@clalancette @nnmm I opened a PR for the items above: #914.

I will wait until you converge on the item 2 and added afterwards.

@clalancette
Copy link
Contributor

Isn't the question of straight-line/callback style independent from whether timeouts are used? I think we agree that any good solution should discourage waiting indefinitely.

Sort of. I was more thinking about the case where you did this in a callback:

future = client->async_send_request()
future.wait_for(1000ms);

If the future took 990ms to return, you would still succeed, but you would have starved the rest of the system for essentially a full second. That number is somewhat contrived, but any timeout you choose put in there has the potential to add that amount of latency to downstream callbacks.

In general, in this callback-based system, it is best to do as little work as possible in the callbacks and return quickly. That way you keep the latency down on the entire system.

@fred-apex-ai
Copy link

@clalancette Could you provide or point to an example how to write "straight-line"ish code with a multi-threaded executor? Would that mean that in writing a node in a particular way, I need to make assumptions about in what executor it is used? That would sound like it breaks encapsulation

@nnmm
Copy link
Contributor

nnmm commented Nov 19, 2020

We do have service_is_ready, so you can tell if any server is available to take your call. That is subject to a TOCTTOU race, of course.

Yep, and there are other reasons your callback might never run, like an unreliable network.

Between the request and the callback, the state of your node may have totally changed.

While that is true, that is true of anything that takes a substantial amount of time to complete. So I'm not sure that this is an argument against a callback mechanism.

I was thinking about a setup where no other callbacks are being run while the current one is still active, as with the default single-threaded executor (although I don't know if that is fundamentally incompatible with receiving the service call result).

I'd argue it's a general advantage of straight-line code that it tends to have less cognitive complexity: It's more "well-behaved" by default which helps prevent bugs. It potentially also means people need to know/think less to use it, as well as write less code – as an example, imagine a service handler that computes its response by calling two other services in parallel (gist with example code here).

The way around it is of course threading, either by hand (std::thread), or by using the MultiThreadedExecutor. In the latter case, you can actually write the straight-line code you want, since you'll have other threads available to run callbacks for you.

I concur with @fred-apex-ai, while this shouldn't become a how-to discussion, in the example above I already was using a MultiThreadedExecutor and that blocked forever. A WaitSet approach didn't work either.

In general, in this callback-based system, it is best to do as little work as possible in the callbacks and return quickly. That way you keep the latency down on the entire system.

I see what you mean, users can set a too large timeout and jam their system when the called service does too much work itself, is down, or experiences a latency spike. They already can do that in various other ways though, by computing for too long or with tf2_ros timeout APIs for instance.¹

Despite that it could be useful to make straight-line service service calls from inside a node. Of course one scenario would be service calls that are quick relative to the latency requirements of your messages. Then there probably are nodes where queuing up incoming messages is the intended behavior. I'm sure other people have more to add to this.

In an effort to bring this to a conclusion, my summary would be:

  • Straight-line service calls have nice properties for writing correct code
  • Straight-line service calls can easily add latency to a node. So the question is if the valid use cases are worth the risk.
  • In practice, supporting straight-line service calls is difficult. I don't think a works-only-in-specific-setups API like Python's synchronous API would be worth it.
  • Adding timeouts to the existing callback API (with callbacks being called also in the failure case) could help control latency and handle errors.
  • Services definitely should have documentation for everything discussed here, I'd suggest right in the rclcpp docs.

I think you're better equipped to comment on what should be in the roadmap, Chris.

¹ And they should be able to mitigate the problem with a multithreaded executor, at the cost of making things happen behind their backs again, right?

@fred-apex-ai
Copy link

One take away for me is to realize the cost associated with any service call both in terms of latency and algorithm complexity which leads me to recommend to avoid service calls when possible and design the hot path functions so they have all the required input when they start.

clalancette pushed a commit to dejanpan/ros2_documentation that referenced this issue Nov 30, 2020
For full description and discussions see ros2#901.
@clalancette
Copy link
Contributor

First, apologies for the long delay in response here; I was on vacation all last week, and I'm just catching up. But I also wanted to spend some time to put examples together to show you how you can achieve the straight-line code you are looking for (even if I still don't think it is a good idea, for other reasons). With that said, I'll respond to some of the comments:

@clalancette Could you provide or point to an example how to write "straight-line"ish code with a multi-threaded executor? Would that mean that in writing a node in a particular way, I need to make assumptions about in what executor it is used? That would sound like it breaks encapsulation

https://github.com/clalancette/executor_examples now has a number of examples of calling a service "in-line" in a callback. Each example has a bit of explanatory text explaining why it does or does not work. In short, the examples in there that work are:

(the other examples have explanations on why they don't work for straight-line code)

I see what you mean, users can set a too large timeout and jam their system when the called service does too much work itself, is down, or experiences a latency spike. They already can do that in various other ways though, by computing for too long or with tf2_ros timeout APIs for instance.¹

Right, that's why my general recommendation is to do as little work as possible in the callback, and then do "real" processing elsewhere.

In an effort to bring this to a conclusion, my summary would be:

  • Straight-line service calls have nice properties for writing correct code

Yes, agreed.

  • Straight-line service calls can easily add latency to a node. So the question is if the valid use cases are worth the risk.

Agreed.

  • In practice, supporting straight-line service calls is difficult. I don't think a works-only-in-specific-setups API like Python's synchronous API would be worth it.

Yes, exactly. That was my initial objection.

  • Adding timeouts to the existing callback API (with callbacks being called also in the failure case) could help control latency and handle errors.

Yes, though at the moment I still don't know exactly what that looks like.

  • Services definitely should have documentation for everything discussed here, I'd suggest right in the rclcpp docs.

Agreed that there should be more documentation around this.

I'll also add in one more thought that occurred to me as I was playing around/thinking about this. One other way to allow more "straight-line" code would be to add Unix select-type functionality to the executors. That is, instead of registering for callbacks, you instead register and get "handles" back, and then you call select to wait for one of those handles to become ready. You then service the handles that are ready. There are a bunch of details in there (and I haven't though this all the way through), but it is another paradigm that might work here.

I think you're better equipped to comment on what should be in the roadmap, Chris.

I'm trying to keep the roadmap a list of things that could reasonable get done for Galactic. At the moment, Open Robotics doesn't have the resources to implement any of the above. So while I'm happy to add something to the roadmap, I would only do so if someone else commits to it.

@clalancette clalancette self-assigned this Dec 17, 2020
@jdlangs
Copy link

jdlangs commented Dec 17, 2020

IMO, using the MultiThreadedExecutor is the simplest solution to the problem and should be appropriate for the large majority of use cases. I think the main problem is just the awkwardness of the API, where manual management of callback group objects is required. A small change that could go a long way towards mitigating this is an option to make the default callback group reentrant when the node is constructed (or even to flat-out change the default). Is there any downside to that?

@clalancette
Copy link
Contributor

A small change that could go a long way towards mitigating this is an option to make the default callback group reentrant when the node is constructed

We could certainly consider adding an option to node construction. Note that this ends up being syntactic sugar, since you can achieve this in a more verbose way by creating your own callback groups.

(or even to flat-out change the default). Is there any downside to that?

Changing the default would be a big change. Right now, any code that is written using the default can use resources shared with other callbacks with impunity (as it is guaranteed only one runs at a time, even with the MultiThreadedExecutor). If we change the default, then suddenly those assumptions no longer hold true and there is a large potential for breakage in very subtle ways. I don't think we can reasonably do that, or at least reasonably do that without a very strong justification.

@jdlangs
Copy link

jdlangs commented Dec 17, 2020

Yes, you can already get the behavior now; I just think there's a lot to be said for the user experience of flipping a bit one time in one location instead of worrying about the callback group object for everything in the node.

I will try to create a draft PR for this over the holidays since I expect it should be a very simple change.

@wjwwood
Copy link
Member

wjwwood commented Jan 4, 2021

Sorry I haven't chimed in yet, but this conversation (specifically about item number 2) kept getting longer and I could never catch up till now :)


First I want to address what I think is the core issue here for an async/await style service call (what you all have been calling "straight-line service call" I think).

@clalancette said and then @nnmm replied:

Let's assume for a second that we can solve the deadlock case, and you could call service calls "straight-line" from callbacks. I still don't think that is a good idea, because now your callbacks can take arbitrary amounts of time, and now you are potentially starving the rest of callbacks in the system while you are waiting for your service result.

Isn't the question of straight-line/callback style independent from whether timeouts are used? I think we agree that any good solution should discourage waiting indefinitely.

Then @clalancette also said:

Sort of. I was more thinking about the case where you did this in a callback:

future = client->async_send_request()
future.wait_for(1000ms);

If the future took 990ms to return, you would still succeed, but you would have starved the rest of the system for essentially a full second. That number is somewhat contrived, but any timeout you choose put in there has the potential to add that amount of latency to downstream callbacks.

In general, in this callback-based system, it is best to do as little work as possible in the callbacks and return quickly. That way you keep the latency down on the entire system.

I think @clalancette is right here that we don't want people blocking inside of callbacks because it will prevent other work from being done, especially so in a single threaded executor, but also it ties up resources in a multithreaded executor. And anyways we don't want to have the business logic care about what executor is used.

The obvious way to fix this in my experience is what @clalancette suggested elsewhere, you spawn a thread and continue the work there. You can observe this pattern in many frameworks, but it is very cumbersome some times. A good example is in older Javascript, where there is no blocking and no async/await syntax, so everything becomes do_this_thing_that_takes_time.then(on_success=<anonymous function>, on_error=<another anonymous function>) and it recurses within those anonymous functions. You can also see this in frameworks like boost asio and rxcpp, e.g.:

This obviously isn't ideal, but it's what you have to do without language support for async/await.

More modern languages like Python3 and newer versions of Javascript have first-class support for async/await, which allows you to write code that appears to be synchronous (what you all have been calling "straight line code") but is actually asynchronous.


So, at best I think we could:

Would an example that looks something like this be ok with you guys?

void ros_timer_callback_from_spin()
{
  auto future = client_->async_send_request(
    request,
    10s,
    [](auto response) { /* success, could continue logic here */ },
    [](auto error) { /* could be timeout or error */ });
}

My opinion on this is that we cannot do better (than the example just above) because of two reasons:

  • c++ does not give us async/await style syntax and support, which is need to make this workable in languages like Python
  • the abstraction of the waiting via dds's waitset and/or listeners (assuming a DDS implementation) prevents us from integrating into event loops of other frameworks like rxcpp or boost asio
    • I actually discussed this at length with Misha Shalem at Apex (I don't know his GitHub handle) probably more than a year ago, specially about reusing an existing framework like boost asio, and he had some interesting ideas on how to do this by implementing our own io_service for boost asio using the DDS wait system, but we never did anything with that.
    • also, this is kind of an issue with how it is implemented in Python (rclpy) atm (if I understand correctly the current state of affairs there), because we cannot properly integrate into the Python asyncio system due to needing to wait on handles ourselves rather than delegating that to python

I think we all agree what we'd like is something like this:

void some_callback_from_spin(...)
{
  // ...
  auto result = await service_client->make_async_request(..., 10s);
  if (result.did_timeout) {
    // error handling
  }
  auto response = result.get();
  print_result(response);
}

But I think the best we can do is something like this:

void some_callback_from_spin(...)
{
  // ...
  auto result = await service_client->make_async_request(
    ...,
    10s,
    [](auto response) {
      print_result(response);
    },
    [](auto error) {
      // error handling
    }
  );  
}

@wjwwood
Copy link
Member

wjwwood commented Jan 4, 2021

I should say too, that even if we want to do what I proposed the technical details of how to do that are still a bit fuzzy to me, but I think they can be done. It will probably require the use of ROS timers and/or threads, and in which case we'd probably want to expose those details somehow, because while convenient this may cause scaling issues (creating a new thread each time for instance vs using a thread-pool, etc.).

@fred-apex-ai
Copy link

fred-apex-ai commented Jan 7, 2021

Dear @wjwood, thanks a lot for taking the time to share your thoughts and an explicit proposal. This asynchronous discussion about asynchronous calls really adds a lot of latency (pun intended): two months since we started and it's getting hard to follow.

Adding an error callback would be a big improvement. If callback style is desired, how about going all in and having a version without std::future? I.e. no return value and only the plain values (or references to them) as callback parameters.

I'm not sure about the type passed to the error callback in the above example. Currently, the service handler can't send errors back to the client (unless modeled by the response data type), right?

Our conclusion

We (i.e. @nnmm and myself) would be very pleased if you could add the below items on the agenda for future work. We would support you with buying a coffee or beer if that helps :)

add an error/timeout callback(s) to the async_send_request() method

yes, sounds great

preventing recursive spinning with a better error message

yes, please

more documentation

oh yeah, I think this issue contains enough links and knowledge on its own

IMO, using the MultiThreadedExecutor is the simplest solution to the problem and should be appropriate for the large majority of use cases. I think the main problem is just the awkwardness of the API, where manual management of callback group objects is required

If that helps and it can be made a bit more user friendly, that would help, too.

async/await

We aren't async/await experts over here, but isn't async/await a different thing from what was discussed here? The function that uses the "straight-line" call would look and behave like a normal synchronous function from the outside, but a function using await would have to be async itself and couldn't be used as a subscription callback?

@clalancette
Copy link
Contributor

@fred-apex-ai What I'll do is suggest you open up issues corresponding to each of the features you'd like to see. I can't guarantee that we'll work on it, but it will at least be tracked and can be considered for a future release. It looks like most of these feature requests are for rclcpp, so I'll suggest opening them against https://github.com/ros2/rclcpp

As for some documentation updates, it would be really helpful if you could look through the existing documentation in this repository and find a place where you feel like what you learned here applies. And then open a PR with some of those changes. I'm happy to review changes as they come in.

Thanks!

clalancette pushed a commit to dejanpan/ros2_documentation that referenced this issue Mar 1, 2021
For full description and discussions see ros2#901.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants