Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp #728

Open
sea-bass opened this issue Oct 25, 2023 · 11 comments
Open

Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp #728

sea-bass opened this issue Oct 25, 2023 · 11 comments
Labels
backlog bug Something isn't working

Comments

@sea-bass
Copy link

sea-bass commented Oct 25, 2023

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 (inside a Docker container)
  • Installation type:
    • Binaries
  • Version or commit hash:
    • The latest available binaries on Humble
  • DDS implementation:
    • rmw_fastrtps_cpp
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

The segfault we see is:

[objective_server_node_main-8] [ERROR] [1684327668.068245959] [objective_server_node]: Execution of behavior tree for objective `3 Waypoints Pick and Place` did not succeed.
[objective_server_node_main-8] Stack trace (most recent call last) in thread 833:
[objective_server_node_main-8] #8    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[objective_server_node_main-8] #7    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e5a5d1b, in 
[objective_server_node_main-8] #6    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e53d5c7, in 
[objective_server_node_main-8] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0xffff9e7731fb, in 
[objective_server_node_main-8] [1684327680.555]: Sequence                  IDLE -> RUNNING
[objective_server_node_main-8] [1684327680.555]: ClearSnapshot             IDLE -> RUNNING
[objective_server_node_main-8] #4    Source "./src/rclcpp/executors/multi_threaded_executor.cpp", line 85, in run [0xffff9edbffc7]
[objective_server_node_main-8] #3    Source "./src/rclcpp/executor.cpp", line 912, in get_next_executable [0xffff9edb8b5b]
[objective_server_node_main-8] #2    Source "./src/rclcpp/executor.cpp", line 756, in wait_for_work [0xffff9edb5bdf]
[objective_server_node_main-8] #1    Object "/opt/ros/humble/lib/librcl.so", at 0xffff9e185b97, in rcl_wait
[objective_server_node_main-8] #0    Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0xffff9d5f5718, in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*)
[objective_server_node_main-8] Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
[ERROR] [objective_server_node_main-8]: process has died [pid 666, exit code -11, cmd '/opt/overlay_ws/install/moveit_studio_agent/lib/moveit_studio_agent/objective_server_node_main --ros-args --log-level error --ros-args -r __node:=objective_server_node --params-file /tmp/launch_params_j4cwbvx_ --params-file /tmp/launch_params_7oyj5lbr --params-file /tmp/launch_params_2aqbg8_q --params-file /tmp/launch_params_12q1oav_ --params-file /tmp/launch_params_h4pxzy2y --params-file /tmp/launch_params_8vn2ch2o --params-file /tmp/launch_params_1c4b41ac --params-file /tmp/launch_params_blie9quv'].

One of our engineers looked into this with a debugger and saw:

Thread 25 "objective_serve" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffa535e660 (LWP 74792)]
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (
identifier=, subscriptions=,
guard_conditions=0xaaaaab4e0b30, services=,
clients=, events=, wait_set=,
wait_timeout=)
at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
#1 0x0000fffff6fc5b98 in rcl_wait (wait_set=0xffffffffa110, timeout=250000000)
at ./src/rcl/wait.c:595
#2 0x0000fffff7bf5be0 in rclcpp::Executor::wait_for_work (
this=0xffffffffa060, timeout=...) at /usr/include/c++/11/chrono:521
#3 0x0000fffff7bf8b5c in rclcpp::Executor::get_next_executable (
this=0xffffffffa060, any_executable=..., timeout=...)
at ./src/rclcpp/executor.cpp:912
#4 0x0000fffff7bfffc8 in rclcpp::executors::MultiThreadedExecutor::run (
this=0xffffffffa060, this_thread_number=)
at ./src/rclcpp/executors/multi_threaded_executor.cpp:85
#5 0x0000fffff75b31fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000fffff737d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#7 0x0000fffff73e5d1c in ?? () from /lib/aarch64-linux-gnu/libc.so.6`

Additional information

This segfault does not occur using Cyclone DDS.

Implementation considerations

@fujitatomoya
Copy link
Collaborator

Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {

if (!condition->get_trigger_value()) {

static_cast<eprosima::fastdds::dds::GuardCondition *>(data) is NULL, that led to the segmentation fault.

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

@fujitatomoya fujitatomoya added the bug Something isn't working label Oct 25, 2023
@fujitatomoya
Copy link
Collaborator

CC: @Barry-Xu-2018 @iuhilnehc-ynos

@sea-bass
Copy link
Author

sea-bass commented Oct 25, 2023

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our main is fairly simple:

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);
  rclcpp::NodeOptions opt;
  opt.automatically_declare_parameters_from_overrides(true);
  auto node = std::make_shared<OurNodeType>(opt);

  // This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
  rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
  exec.add_node(node->get_node_base_interface());
  exec.spin();
  exec.remove_node(node->get_node_base_interface());

  rclcpp::shutdown();
}

The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one.

Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem?

@Barry-Xu-2018
Copy link
Contributor

@fujitatomoya

According to current information, the problem is related to
https://github.com/ros2/rclcpp/blob/8709146df87ab860c88f03a83f6df7d0347d0085/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L494
Here, shared pointer wasn't used.
It is used in Executor::wait_for_work() without lock.

It can be removed under the below function

  1. Executor::wait_for_work()
  2. Executor::remove_callback_group_from_map()
    called by Executor::remove_callback_group() or Executor::remove_node()
  3. Executor::~Executor()

So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map().
I don't know why current implementation don't use shared pointer to save GuardCondition.
Besides, remove action should be in Executor::wait_for_work() and Executor::~Executor().

@sea-bass
Copy link
Author

I'm just coming back to this issue and I think adding locks to the functions in AllocatorMemoryStrategy that touch the guard_conditions_ member variable seems promising. I might have time this week to take a look and land some fixes.

If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble.

@sea-bass
Copy link
Author

sea-bass commented Jan 15, 2024

Update: I tried putting locks in both allocator_memory_strategy.cpp and executor.cpp and the chain of FastDDS issues keeps elevating up the chain with more segfaults.

I think the problem is more pervasive than just stuff at the rclcpp layer, sadly.

[planning_context_manager_node-2] terminate called after throwing an instance of 'std::out_of_range'
[planning_context_manager_node-2]   what():  vector::_M_range_check: __n (which is 2674159792) >= this->size() (which is 142)
[planning_context_manager_node-2] Stack trace (most recent call last) in thread 10468:
[move_group-1] double free or corruption (out)
[move_group-1] Stack trace (most recent call last) in thread 10409:
[waypoint_manager_node-4] corrupted size vs. prev_size
[planning_context_manager_node-2] #22   Object "", at 0xffffffffffffffff, in 
[planning_context_manager_node-2] #21   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537bdd84f, in 
[planning_context_manager_node-2] #20   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4bac2, in 
[planning_context_manager_node-2] #19   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537ddc252, in 
[component_container_mt-8] double free or corruption (out)
[planning_context_manager_node-2] #18   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp", line 271, in deferred_signal_handler [0x7fc53868087a]
[planning_context_manager_node-2]         268:             "deferred_signal_handler(): "
[planning_context_manager_node-2]         269:             "shutting down rclcpp::Context @ %p, because it had shutdown_on_signal == true",
[planning_context_manager_node-2]         270:             static_cast<void *>(context_ptr.get()));
[planning_context_manager_node-2]       > 271:           context_ptr->shutdown("signal handler");
[planning_context_manager_node-2]         272:         }
[planning_context_manager_node-2]         273:       }
[planning_context_manager_node-2]         274:     }
[planning_context_manager_node-2] #17   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/context.cpp", line 343, in shutdown [0x7fc5385b3ae9]
[planning_context_manager_node-2]         340:     std::lock_guard<std::recursive_mutex> guard(*logging_mutex_);
[planning_context_manager_node-2]         341:     size_t & count = get_logging_reference_count();
[planning_context_manager_node-2]         342:     if (0u == --count) {
[planning_context_manager_node-2]       > 343:       rcl_ret_t rcl_ret = rcl_logging_fini();
[planning_context_manager_node-2]         344:       if (RCL_RET_OK != rcl_ret) {
[planning_context_manager_node-2]         345:         RCUTILS_SAFE_FWRITE_TO_STDERR(
[planning_context_manager_node-2]         346:           RCUTILS_STRINGIFY(__file__) ":"
[planning_context_manager_node-2] #16   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f4a1b6, in rcl_logging_fini
[planning_context_manager_node-2] #15   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49dcc, in 
[planning_context_manager_node-2] #14   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49521, in 
[planning_context_manager_node-2] #13   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220f13b, in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher(char const*, rmw_node_s const*, rmw_publisher_s*)
[planning_context_manager_node-2] #12   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220eccb, in rmw_fastrtps_shared_cpp::destroy_publisher(char const*, CustomParticipantInfo*, rmw_publisher_s*)
[planning_context_manager_node-2] #11   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521ce29fe, in eprosima::fastdds::dds::PublisherImpl::delete_datawriter(eprosima::fastdds::dds::DataWriter const*)
[planning_context_manager_node-2] #10   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521cc9298, in eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl()
[planning_context_manager_node-2] #9    Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521b5de42, in 
[planning_context_manager_node-2] #8    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf72dc, in _Unwind_Resume
[planning_context_manager_node-2] #7    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf6883, in 
[planning_context_manager_node-2] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad958, in __gxx_personality_v0
[planning_context_manager_node-2] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad1e8, in 
[planning_context_manager_node-2] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dae20b, in 
[planning_context_manager_node-2] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537da2b9d, in 
[planning_context_manager_node-2] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537adf7f2, in abort
[planning_context_manager_node-2] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537af9475, in raise
[planning_context_manager_node-2] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4d9fc, in pthread_kill
[planning_context_manager_node-2] Aborted (Signal sent by tkill() 10370 1000)

@roncapat
Copy link

roncapat commented Mar 20, 2024

Hi, I'm hitting I think the same bug.

Here's the backtrace:

#5  0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6  0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7  std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8  std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:346
#9  std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0, 
    __position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38, 
    events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, 
    wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Mar 27, 2024
@fujitatomoya
Copy link
Collaborator

@roncapat

I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

can you provide the reproducible example? i cannot reproduce the issue with fujitatomoya/ros2_test_prover@db4a5ab

i think we need add and remove nodes to&from the executor to make it happen?

I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

thanks for checking, once it is confirmed, i guess we can move this issue to rclcpp.

I don't know why current implementation don't use shared pointer to save GuardCondition.

me neither, this is introduced on ros2/rclcpp#1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e)
before addressing this, i want to confirm the problem 1st with reproducible environment if possible.

@roncapat
Copy link

roncapat commented Mar 28, 2024

@fujitatomoya see linked rclcpp issue: ros2/rclcpp#2455

It all started from there for me. There is attached a minimum example.

@fujitatomoya
Copy link
Collaborator

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

@sea-bass
Copy link
Author

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

Thank you for the update, and for the actual fix! This was a big PR.

@EzraBrooks @MikeWrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants