Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault due data race on destroyed listener #717

Closed
mauropasse opened this issue Oct 3, 2023 · 11 comments
Closed

Segfault due data race on destroyed listener #717

mauropasse opened this issue Oct 3, 2023 · 11 comments

Comments

@mauropasse
Copy link
Contributor

mauropasse commented Oct 3, 2023

On ROS2 Iron (or Galactic/Humble with EventsExecutor package), the example below produces a seg-fault due a data race trying to access a deleted listener

custom_event_info->get_listener()->set_on_new_event_callback(

I made a proof of concept fix for this: mauropasse@adf1aa7

Not sure if there are better ways to handle the situation. What do you think?

#include <memory>
#include <thread>
#include <chrono>
#include "rclcpp/rclcpp.hpp"
#include "std_msgs/msg/string.hpp"

class MinimalPublisher : public rclcpp::Node
{
public:
  MinimalPublisher(std::string name)
  : Node(name)
  {
    publisher_ = this->create_publisher<std_msgs::msg::String>("topic", rclcpp::SensorDataQoS());
  }
private:
  rclcpp::Publisher<std_msgs::msg::String>::SharedPtr publisher_;
};

int main(int argc, char * argv[])
{
  rclcpp::init(argc, argv);
  auto executor = std::make_shared<rclcpp::experimental::executors::EventsExecutor>();
  std::thread spinner([&](){executor->spin();});

  for (int i=0; i<1000;i++) {
      auto node = std::make_shared<MinimalPublisher>("Node_"+std::to_string(i));
      executor->add_node(node);
  }

  rclcpp::shutdown();
  spinner.join();
  return 0;
}

The backtrace:

Thread 4 "example_segfault" received signal SIGSEGV, Segmentation fault.

#0  0x..in rmw_fastrtps_shared_cpp::__rmw_event_set_callback (rmw_event=0x5555559ea0b0, callback=0x0, user_data=0x0)
    at /root/iron_ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_event.cpp:160
#1  0x..in rmw_event_set_callback (rmw_event=0x5555559ea0b0, callback=0x0, user_data=0x0)
    at /root/iron_ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_event.cpp:59
#2  0x..in rmw_event_set_callback (v3=0x5555559ea0b0, v2=0x0, v1=0x0)
    at /root/iron_ros2/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:738
#3  0x..in rcl_event_set_callback (event=0x5555558853b8, callback=0x0, user_data=0x0) at /root/iron_ros2/src/ros2/rcl/rcl/src/rcl/event.c:244
#4  0x..in rclcpp::EventHandlerBase::set_on_new_event_callback (this=0x555555885360, callback=0x0, user_data=0x0)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/event_handler.cpp:86
#5  0x..in rclcpp::EventHandlerBase::clear_on_ready_callback (this=0x555555885360)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/include/rclcpp/event_handler.hpp:219
#6  0x..in rclcpp::EventHandlerBase::~EventHandlerBase (this=0x555555885360, __in_chrg=<optimized out>)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/event_handler.cpp:47
#7  0x..in rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> >::~EventHandler() (this=0x555555885360, __in_chrg=<optimized out>) at /root/iron_ros2/src/ros2/rclcpp/rclcpp/include/rclcpp/event_handler.hpp:239
#8  0x..in __gnu_cxx::new_allocator<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > >::destroy<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > >(rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> >*) (this=0x555555885360, __p=0x555555885360)
    at /usr/include/c++/11/ext/new_allocator.h:168
#9  0x..in std::allocator_traits<std::allocator<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > > >::destroy<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > >(std::allocator<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > >&, rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> >*) (__a=..., __p=0x555555885360)
    at /usr/include/c++/11/bits/alloc_traits.h:535
#10 0x..in std::_Sp_counted_ptr_inplace<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> >, std::allocator<rclcpp::EventHandler<std::function<void (rmw_qos_incompatible_event_status_s&)>, std::shared_ptr<rcl_publisher_s> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (this=0x555555885350) at /usr/include/c++/11/bits/shared_ptr_base.h:528
#11 0x..in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555555885350)
    at /usr/include/c++/11/bits/shared_ptr_base.h:168
#12 0x..in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffff447e018, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr_base.h:705
#13 0x..in std::__shared_ptr<rclcpp::Waitable, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffff447e010, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr_base.h:1154
#14 0x..in std::shared_ptr<rclcpp::Waitable>::~shared_ptr (this=0x7ffff447e010, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr.h:122
#15 0x..in rclcpp::CallbackGroup::collect_all_ptrs(std::function<void (std::shared_ptr<rclcpp::SubscriptionBase> const&)>, std::function<void (std::shared_ptr<rclcpp::ServiceBase> const&)>, std::function<void (std::shared_ptr<rclcpp::ClientBase> const&)>, std::function<void (std::shared_ptr<rclcpp::TimerBase> const&)>, std::function<void (std::shared_ptr<rclcpp::Waitable> const&)>) const (this=0x5555558845b0, sub_func=..., service_func=..., 
    client_func=..., timer_func=..., waitable_func=...) at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/callback_group.cpp:111
#16 0x..in rclcpp::executors::build_entities_collection (callback_groups=std::vector of length 1, capacity 1 = {...}, collection=...)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/executor_entities_collection.cpp:56
#17 0x..in rclcpp::experimental::executors::EventsExecutor::refresh_current_collection_from_callback_groups (this=0x555555635da0)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp:392
#18 0x..in operator() (__closure=0x555555636c70)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp:64
#19 0x..in std::__invoke_impl<void, rclcpp::experimental::executors::EventsExecutor::EventsExecutor(rclcpp::experimental::executors::EventsQueue::UniquePtr, bool, const rclcpp::ExecutorOptions&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#20 0x..in std::__invoke_r<void, rclcpp::experimental::executors::EventsExecutor::EventsExecutor(rclcpp::experimental::executors::EventsQueue::UniquePtr, bool, const rclcpp::ExecutorOptions&)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:111
#21 0x..in std::_Function_handler<void(), rclcpp::experimental::executors::EventsExecutor::EventsExecutor(rclcpp::experimental::executors::EventsQueue::UniquePtr, bool, const rclcpp::ExecutorOptions&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/11/bits/std_function.h:290
#22 0x..in std::function<void ()>::operator()() const (this=0x555555636c70) at /usr/include/c++/11/bits/std_function.h:590
#23 0x..in rclcpp::executors::ExecutorNotifyWaitable::execute (this=0x555555636c60, data=std::shared_ptr<void> (empty) = {...})
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp:93
#24 0x..in rclcpp::experimental::executors::EventsExecutor::execute_event (this=0x555555635da0, event=...)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp:328
#25 0x..in rclcpp::experimental::executors::EventsExecutor::spin (this=0x555555635da0)
    at /root/iron_ros2/src/ros2/rclcpp/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp:122
@MiguelCompany
Copy link
Collaborator

@mauropasse I am not sure whether your proposed fix is the correct one.

Three lines after setting info->deleted = true; there is a delete info;, so it seems to me the info structure is used after it was destroyed. This would mean there is a call to rmw_event_set_callback after the call to rmw_destroy_publisher.

It might be worth investigating another route: adding a pre-shutdown callback that calls clear_on_ready_callback()

@mauropasse
Copy link
Contributor Author

mauropasse commented Oct 25, 2023

@MiguelCompany yes, the fix is not correct, was just for illustration purpuses.
This is the situation for further analysis, let me know if I'm not correct:

We have in one hand the Publisher object, and in the other, the QOSEventHandler.
They both share the same underlying listener, but can have lifetimes on their own.
Since the QOSEventHandler object can be created (and destroyed) after the publisher is created, when it's destroyed (not necessarily due app shutdown) we must clear the callbacks (otherwise, if an event happens we segfault due the callback is not anymore in scope).

The problem is that, since they share the listener, there is a data race when the publisher is destroyed, and the event wants to access its listener (also destroyed) to unset the callback.

So there should be a mutex or some way to sync this, so the QOSEventHandler doesn't access the deleted listener. In my fix I was trying to add a common thing both accessible by Event and Publisher to avoid the situation.

PD: I'm not sure if I understood correctly your proposal of adding a pre-shutdown callback.

@clalancette
Copy link
Contributor

Can we close this now that ros2/rclcpp#2349 is merged?

@fujitatomoya
Copy link
Collaborator

@mauropasse @clalancette we might want to backport this to Iron? segmentation fault should be avoided, and this is ABI compatible change? it is maintained under experimental, but nice to fix.

@mauropasse
Copy link
Contributor Author

@fujitatomoya it'd be nice to backport this fix both to Humble & Iron. Nevertheless I'm not 100% sure this doesn't cause an ABI break since I'm changing the body of the virtual destructor (base class) and also derived class. This may lead to modifications in the vtable, which can break binary compatibility. Is there any way to test this?

@fujitatomoya
Copy link
Collaborator

This may lead to modifications in the vtable, which can break binary compatibility. Is there any way to test this?

this comes to my interest, i will do some experimental test... my idea is if there is ABI Compatible label is tagged on the PR from github workflow, that would be really nice to save our time... (especially @clalancette 's 😄 ) WDYT?

@clalancette
Copy link
Contributor

@fujitatomoya it'd be nice to backport this fix both to Humble & Iron. Nevertheless I'm not 100% sure this doesn't cause an ABI break since I'm changing the body of the virtual destructor (base class) and also derived class. This may lead to modifications in the vtable, which can break binary compatibility. Is there any way to test this?

You can generally run https://github.com/lvc/abi-compliance-checker for this. That said, I'm pretty sure this is going to break ABI, since we are changing the vtable.

my idea is if there is ABI Compatible label is tagged on the PR from github workflow, that would be really nice to save our time... (especially @clalancette 's 😄 ) WDYT?

Yeah, it would be awesome if we had a workflow that detects ABI compatibility. We'd have to be a bit careful of the UI with it, since ABI breaks are allowed in Rolling, but not in other branches.

@MiguelCompany
Copy link
Collaborator

The destructor of EventHandlerBase was already virtual, so I think the changes will not modify the length of the vtable. Adding a definition of the destructor of EventHandler will just change its implementation from the default to the new behavior.

Let's see what the ABI checker says, but I think this could be ABI compatible.

@mauropasse
Copy link
Contributor Author

Looks like is ABI compatible:

$  abi-compliance-checker -l NAME -old ABI-1.dump -new ABI-2.dump
Preparing, please wait ...
Comparing ABIs ...
Comparing APIs ...
Creating compatibility report ...
Binary compatibility: 100%
Source compatibility: 100%
Total binary compatibility problems: 0, warnings: 0
Total source compatibility problems: 0, warnings: 0
Report: compat_reports/NAME/1_to_2/compat_report.html

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Dec 12, 2023

@Mergifyio backport iron humble (edit: sorry not here...)

@fujitatomoya fujitatomoya reopened this Dec 12, 2023
@fujitatomoya
Copy link
Collaborator

Iron backport: ros2/rclcpp#2387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants