-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp
#728
Comments
could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own |
Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our int main(int argc, char* argv[])
{
rclcpp::init(argc, argv);
rclcpp::NodeOptions opt;
opt.automatically_declare_parameters_from_overrides(true);
auto node = std::make_shared<OurNodeType>(opt);
// This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
exec.add_node(node->get_node_base_interface());
exec.spin();
exec.remove_node(node->get_node_base_interface());
rclcpp::shutdown();
} The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one. Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem? |
According to current information, the problem is related to It can be removed under the below function
So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map(). |
I'm just coming back to this issue and I think adding locks to the functions in If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble. |
Update: I tried putting locks in both I think the problem is more pervasive than just stuff at the
|
Hi, I'm hitting I think the same bug. Here's the backtrace: #5 0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6 0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7 std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8 std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
at /usr/include/c++/11/bits/stl_vector.h:346
#9 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0,
__position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38,
events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>,
wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at |
ros2/rmw_fastrtps#728 Signed-off-by: Tomoya Fujita <[email protected]>
can you provide the reproducible example? i cannot reproduce the issue with fujitatomoya/ros2_test_prover@db4a5ab i think we need add and remove nodes to&from the executor to make it happen?
thanks for checking, once it is confirmed, i guess we can move this issue to
me neither, this is introduced on ros2/rclcpp#1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e) |
@fujitatomoya see linked rclcpp issue: ros2/rclcpp#2455 It all started from there for me. There is attached a minimum example. |
@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142. CC: @mjcarroll |
Thank you for the update, and for the actual fix! This was a big PR. @EzraBrooks @MikeWrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS. |
Bug report
Required Info:
rmw_fastrtps_cpp
rclcpp
Steps to reproduce issue
We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:
The segfault we see is:
One of our engineers looked into this with a debugger and saw:
Additional information
This segfault does not occur using Cyclone DDS.
Implementation considerations
The text was updated successfully, but these errors were encountered: