-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rclcpp_action goal handle timeout logic clearing long-running goals #3765
Comments
The Problem3 is being resolved by removing second
|
The Problem1 and Problem2 are under investigation, I'll write the details as they will be arrived. |
Analysis shown that Problem2 appears in case when So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive The following patch seems to work for me:
@SteveMacenski, doesn't this approach break whole BT logic, I hope? |
The Problem1 root cause is still unclear. Some observations on it: From BT side, when the robot reached its goal,
which should trigger However, from WaypointFollower side, the
From first glance, it might seem that the issue to be in the RCLCPP. However, this problem appears every time when recovery |
Oh, that's a really good find Alexey! Yes, resetting the blackboard for the error codes is sensible at the end of a task and/or when halted due to cancellation of the navigation action.
You should be able to skip this, that blackboard is part of the factory construction so it should always exist at this point.
That is odd. The best that I can think of is that the WPF is sending the Nav2Pose to the BT which itself then sends the appropriate commands to the task servers. When we enter recovery, the task servers are halted when we execute recoveries and then re-tasked once we finish and start navigating again. So the commands to planner/controller/etc are re-issued and thus have a new UUID / goal handle. But we should still be in the same goal handle for Nav2Pose since all that happens internally to the BT which the Nav2Pose action server doesn't track. Perhaps this is another error code issue or something we're we're propagating the error code from the halted task servers instead of the final, complete end result code? That's the only server-client changes I can see happening. Note these shenanigans we have to do with checking that the goal ID is right https://github.com/ros-planning/navigation2/blob/main/nav2_waypoint_follower/src/waypoint_follower.cpp#L307-L312 Try changing the debug to an info in the simple action server to make sure that the current handle is actually active / we're returning success on it?
|
Problem1 bugfix: I've added the checks and got that the
So, this made me think that this is related to RCLCPP and check it. The problem did solved by increasing the
The value of (Sure, this timeout might be set as ROS-parameter for Also, it was found that due to short distances of navigation, the robot operates much more stable on
So, Finally, I'd like to propose to treat |
Oh what the hell... How does this not cause our end-users problems as well for the WPF, the controller server, and the BT task in general? Those are frequently running tasks longer than 10 seconds. Any time limit for how long a task runs in the action server is too low since we could have navigation requests (or WPF, or controller, etc) that could take hours to complete. Lets set this to absolute max anywhere we send actions within Nav2 and inform users about this problem publicly // comment on that ticket for a default to be maxed out or disabled. |
I think that logic needs to be updated so its not based on a timeout but a timeout after an event (e.g. after the goal has been completed or aborted) so that we don't delete currently running goals. Can you file a ticket on that and tag me and @clalancette ? |
New PR #3785 with fix of the most of issues described in this ticket. The problem with RCLCPP action timeout to be handled in next separate step, I'd suppose. |
Got it, @pepisg messaged me yesterday on Slack regarding re-adding the 15 minute timeout on the WPF server specifically as a temporary stop gap. Since the controller updates goals each time its preempted and planning occurs quickly, I think its only an issue with the WPF internally to Nav2. ... but is definitely still an issue for users of Nav2's base action servers for Nav2Pose and NavThroughPoses, which bears some thought about a permanent solution in rclcpp. Is that something you're open to working on @AlexeyMerzlyakov ? Thanks for your diligent work on this and uncovering an rclcpp issue! |
The params have been merged in, so that should be a good medium term solution until a more permanent change can be made in rclcpp itself. As such, I renamed the ticket |
@SteveMacenski @AlexeyMerzlyakov the fix is ready ros2/rcl#1121, it would be appreciated if you can try and review! thanks in advance. CC: @Barry-Xu-2018 |
So much thanks @Barry-Xu-2018! |
@Barry-Xu-2018 can this be closed with your recent PR? |
ros2/rcl#1121 isn't merged by now. |
@SteveMacenski just FYI, ros2/rcl#1121 has been merged in rolling. |
@fujitatomoya are we safe to remove |
Closing as complete, leaving the result timeout for legacy but I'll file a ticket for removing this at Kilted |
This issue describes the flaky issues appeared with
test_waypoint_follower
system testBug report
Required Info:
Steps to reproduce issue
test_waypoint_follower
manually, or using the attached below stress-test script:run_stress.sh.txt
Expected behavior
Test passes 30 times in a row.
Actual behavior
The following problems were found during the test:
Problem1: No action
When the robot is moving near the obstacle, it could stuck. Sometimes, this causes recovery behaviors that being triggered.
If this recovery was triggered (moreover, it was detected that only "Received request to clear entirely the local_costmap" event is enough for that), and robot successfully reached its goal, but the
NavigateToPoseNavigator
action returnsUNKNOWN
status instead ofSUCCEEDED
, which causes mainWaypointFollower
cycle to hang:2023-08-15.15-33-02.no_action_failure_case_only.mp4
Problem2: Incorrect out-of-map status
Sometimes, when running
set waypoint outside of map
test scenario,missed_waypoint
containsINVALID_PATH=103
status instead of expectedComputePathToPose.Goal().GOAL_OUTSIDE_MAP=204
which gives the following error messages:Problem3: Robot is getting lost after setting initial pose second time
In the
![Screenshot_2023-08-16_09-41-43_cr](https://private-user-images.githubusercontent.com/60094858/261645656-3837278c-a4ad-445e-ba74-7d8c98fb03ca.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzM2NTIsIm5iZiI6MTczOTM3MzM1MiwicGF0aCI6Ii82MDA5NDg1OC8yNjE2NDU2NTYtMzgzNzI3OGMtYTRhZC00NDVlLWJhNzQtN2Q4Yzk4ZmIwM2NhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDE1MTU1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTczZDUxNTZiN2U5YmVjNzhlOWNlMGIxYmFmZDQ0MmJiNTIzZTgwNzhlZTRmNDk1ZjY0ODc5NWUzYWEyMjJmZjYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.GhDxBsctHmH6QbJMyGZWPSv9yD37JndbuaW2bGPJN6U)
![Screenshot_2023-08-16_09-41-50_cr](https://private-user-images.githubusercontent.com/60094858/261645666-e30f4ae6-4d80-4139-8214-a60ef15d32ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzM2NTIsIm5iZiI6MTczOTM3MzM1MiwicGF0aCI6Ii82MDA5NDg1OC8yNjE2NDU2NjYtZTMwZjRhZTYtNGQ4MC00MTM5LTgyMTQtYTYwZWYxNWQzMmVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDE1MTU1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM3MzY1YWYzMGE1OWNhMzgzMDExNTI1M2JiZmNiNGM5Y2ZmOTczMDZkZWVmMGYwNTczYWM3NWY2ZTAyOGRiNjMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.QHtU5GL7UCP6PNEDRvuyD-WvMgtJOAHHVFMjC9ZPv5Y)
stop on failure test with bogous waypoint
test scenario there istest.setInitialPose(starting_pose)
called second time while robot is not placed on its initial pose. This causes robot to lose its position. In most cases, robot still operating and reaching its goal; but sometimes since its position is getting lost, robot continues its uncontrolled movement between pillars causing testcase to fail by timeout:The text was updated successfully, but these errors were encountered: