-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller/planner server segfault on clear local/global costmap #2373
Comments
I'm not sure, it would definitely be worth more investigation. Nothing has changed anytime recently that I'm aware of, but the only thing that's really changed since the initial port from ROS1 was 13f518a#diff-15e7b290bebe57765609a46208aa6f20ba2b06ac10a3237f06259265a086cacc -- so that's where I'd start investigating. I don't have the resources to immediately address this, so I'd appreciate a PR once you have a good understanding of the problem, the traceback makes it pretty clear to me its related to the voxel layer or voxel grid. What's your compute platform? |
My testing compute platform is a 10th gen i7 laptop |
In the presented above
In parallel infinite while-loop I've sent @charlielito, did you ever observe the crash on any Gazebo and TB3 simulation? |
Hey thanks for taking the time to look into this! Indeed the crash happens with our Gazebo environment with a custom robot. It also happened in our real hardware/robot. I'll try the TB3 waffle demo to see if it is crashing there and also come back with some ros2bag for you |
Yes, it would be really appreciated, if you could provide a context to catch the problem. |
In thread 14, it looks like the usual update map cycle: Are you running SLAM at the same time? That shouldn't make anything break, just looking for clues about why it is that some data location wouldn't be available. None of this has changed since ROS1. Do you see this happen with the 2D obstacle layer at all (if you were willing to try that for us)? The voxel layer uses a ton of the same mechanics as the obstacle layer so if you saw it there, that would help narrow it down. Are you going "off the map" at all? If you set your max range of your local costmap sensor to below the costmap size, do you still see it? What sensor are you using to get this error. |
@AlexeyMerzlyakov I am encountering this error in Gazebo but with a custom robot. At the end i didn't try the TB3 waffle because the problem seems to be when using pointcloud sensor. I tested my environment without pointcloud sources and never saw any crashes. Just when I add a pointcloud source to the VoxelLayer, the problems begin. Also tried your strategy of calling the clearing services both to local and global costmaps, but couldn't make it crash 😞 I am only seeing this crash whilst navigating to a goal. Here you can find a zip rosbag attached with all topics. The robot got stuck and I was setting a new goal when the planner server died in that case. https://drive.google.com/file/d/1j54RIVWt83MbNRpKmendjb7bOVCf7LzN/view?usp=sharing @SteveMacenski I am not running SLAM. I am running the stack with AMCL with a static map. I also tried only using 2D obstacle layer with a pointcloud source but it didn't crash at all. The sensor that generates the pointcloud is a gazebo stereocamera. |
Huh, did you build a different PCL version or something? I dont think there’s any action we can take then. Should we close this ticket? |
What? Just updated my comment. With PCL I meant pointcloud data, nothing to do with the PCL library. |
I wonder if this is related to nans or infs or something similar. |
@charlielito, thank you very much for carefully prepared dataset. This did helped me: I've reproduced the same issue on This might be stack corruption, GCC bug or something I have no idea yet (e.g. C++ standard deprecation when passing by reference into inline function). Anyway, it looks like the following patch cured the problem for me:
After one week of break (I'll be on vacation) I will return to this problem and try to figure out more detailed explanation of what it happens here. |
That was a brand new feature added recently to have the minimum / maximum raytracing and marking ranges. So I don't think its out of the question that its a legit bug. I'd make sure to check the |
It looks like I've finally found what is happening in More debugging shown that we are running out of the space of VoxelGrid So, regarding the root of the problem, it should be related to this code in
When we setting minimal The bug was introduced with 13f518a as @SteveMacenski initially noted, that is why it did not observed on Foxy. This need more technical analysis about how it should be correctly fixed. One of the option - is to add boundary checks to |
Update: The out of array bounds accessing situations were appeared due to incorrectly calculated Additionally, the shifts for |
@charlielito Can you verify this goes away using #2460 @AlexeyMerzlyakov can you verify that you tested this went away and is working properly? |
I think I need to add some regression TC here for the cases which fixed in this issue. |
I made the changes proposed in #2460 to my galactic branch and tested it. It seems that this solved the planner server segfault since I run several times the stack and never encountered that segfault. Nonetheless, the controller server still crashes. Here is the traceback of the controller server crashing: I also recorded another rosbag when the controller crashed: More over, I tested this also with TB3 in simulation and this also is happening there after adding to the yaml file the realsense pointcloud source. This way you could try to reproduce it more easily by your end. Tested with this yaml file (has .txt extension because github does not support .yaml files): For reproducing the error send goals near the walls |
According to the traceback, the cause of the crash is still out-of-voxel-data array accessing in |
@charlielito, Alexey has come up with a new solution, can you test that out and verify if it fixed the underlying issue? #2460 |
Just tested #2460 locally and it seems it is working, it hasn't crashed any time. Thanks! |
Sweet! Anything further you'd want to see before we merge / close the ticket? |
Bug report
Steps to reproduce issue
I am running gazebo with a custom map and custom robot. I am able to navigate, and while doing it, sometimes randomly the controller server dies when requested to clear the local costmap. Also sometimes the planner server dies when requested to clear the global costmap. I have the same set up as in foxy, just changed a little the
nav2_params.yaml
to meet the new API. I am also using the same default BT file as in foxy branchnavigate_w_replanning_and_recovery.xml
. In foxy everything runs well and it never dies.nav2_params.yaml
Expected behavior
Do not crash any node
Actual behavior
Sometimes controller server crashes, sometimes planner server
Additional information
Here are some backtraces when the nodes die
controller_server_segfault.log
planner_server_segfaul.log
The text was updated successfully, but these errors were encountered: