Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

navigator: fix edge case with valid idle setpoint #14134

Merged
merged 3 commits into from
Mar 18, 2020
Merged

Conversation

julianoes
Copy link
Contributor

This is an attempt to fix an edge case in the triplet publication which can lead to crashes on autopilots with slow SD cards.

The sequence of events before this patch is:

  1. Switch to POSCTL when disarmed. At this point current valid with setpoint idle is published.
  2. Arm, takeoff, and fly using joystick/RC.
  3. Switch to RTL (or trigger RTL using RC loss). At this point the setpoint is valid but still idle and the motors will shut off.
  4. Once navigator has published the new setpoint (which can take up to 1.5 seconds on slow SD cards) we will hopefully recover.

With this patch we omit this edge case, so we never publish this idle setpoint when landed. The assumption is that this idle setpoint is no longer required with the current flight task code, however, that needs to be further verified.

@dagar
Copy link
Member

dagar commented Feb 11, 2020

Where are you seeing 1.5 seconds spent in navigator? RTL safe points or geofence?

In vehicle_status we carry the timestamp of last nav_state change. What about using this downstream in flight tasks to prevent processing of a stale position setpoint triplet?

MaEtUgR
MaEtUgR previously approved these changes Feb 11, 2020
Copy link
Member

@MaEtUgR MaEtUgR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some SITL testing on this and didn't see anything break but it's really hard to verify since I can't find any way to guarantee what's inside the triplet also to see if there are other such cases.

Note that the IDLE setpoint type is presumably deprecated since there's takeoff logic in the position controller and when the drone is armed and landed it stays idle until the flight task is requesting/allowing a takeoff. In mission the condition is that the output of the jerk optimized trajectory generator has a velocity setpoint that tries to go up with more than 0.3m/s: https://github.com/PX4/Firmware/blob/0d36e5094b4655c960846d481f62361d40cf1d9f/src/lib/flight_tasks/tasks/AutoLineSmoothVel/FlightTaskAutoLineSmoothVel.cpp#L315 This should work fine according to my understanding but we need proper outdoor testing before merging.

@MaEtUgR
Copy link
Member

MaEtUgR commented Feb 11, 2020

In vehicle_status we carry the timestamp of last nav_state change. What about using this downstream in flight tasks to prevent processing of a stale position setpoint triplet?

There was a proposition for that workaround before but IMHO it overcomplicates things even more. Also it doesn't solve the real problem we're experiencing in this case where Naviator is publishing valid triplet setpoints while there's no navigator mode running. I would like to make sure Navigator publishes what it wants to do when a navigator mode is running and leave the triplet invalid whenever it's not considered valid anymore which also was the point of the original fix #13534.

@dagar
Copy link
Member

dagar commented Feb 11, 2020

Also it doesn't solve the real problem we're experiencing in this case where Naviator is publishing valid triplet setpoints while there's no navigator mode running

It would be in addition to that, and as far as I know it's the only thing that actually enables us to begin addressing the race condition between navigator and flight_tasks. There are other ways to do this, but they'd be quite a lot more rework. For example splitting out flight_tasks from mc_pos_control and running both flight_tasks and navigator in the same work queue with a simple priority mechanism so that navigator runs first on vehicle_status change.

@dagar dagar requested a review from a team February 11, 2020 13:52
@dannyfpv
Copy link

Tested on pixhawk4 v5 f-450
Indoor Flight Test
Modes Tested
Stabilized Mode: Good.

Procedure
Arm and Take off in stabilized mode.

Notes
No issues noted, good flight in general.

Log
https://review.px4.io/plot_app?log=23fdae44-6000-42ec-9e49-aa3dac8a744e

@julianoes
Copy link
Contributor Author

Where are you seeing 1.5 seconds spent in navigator? RTL safe points or geofence?

That's just the worst case numbers I remember from doing dm_read on cheap SD cards.

@jorge789
Copy link

Tested on PixRacer V4

Indoor Flight Test
Modes Tested
Stabilized Mode: Good.

Procedure
Arm and Take off in stabilized mode.

Notes
No issues noted, good flight in general.

Log https://review.px4.io/plot_app?log=cd74017f-76b0-4bd9-a628-120f0b3f0563

Tested on CUAV nano V5

Indoor Flight Test
Modes Tested
Stabilized Mode: Good.

Procedure
Arm and Take off in stabilized mode.

Notes
No issues noted, good flight in general.

Log https://review.px4.io/plot_app?log=f6955de2-c20e-4aa5-a8b9-dfc2c7f73166

@julianoes
Copy link
Contributor Author

@PX4/testflights is there a chance that you could also test this with fixedwing? We would especially be interested around the behaviour right after arming. Best would be a takeoff in manual mode but also an auto-takeoff.

@Tony3dr
Copy link

Tony3dr commented Feb 18, 2020

Tested on Fixed Wing Pixhawk 1 V2
Modes Tested
Stabilized Mode:
Mission Mode:

Procedure
Did the preflight check, made sure all the commands were responsive.

Notes
Take off in stabilized successfully after 30-sec, the vehicle did an aggressive maneuver to the right without answering commands. The vehicle crashed, we had no control of the vehicle.

Log: https://review.px4.io/plot_app?log=bf04da6f-e692-4d86-b0ac-c60ec159823d

@julianoes we don't know if the crash has any relation to the PR. We will replace the frame, let us know what you see. Thanks.

@julianoes
Copy link
Contributor Author

julianoes commented Feb 18, 2020

@Tony3dr sorry about the crash! Have you been flying this airframe regularly with master?

From the log I can see that there was a software hardfault. I will investigate it.

Edit: It also looks like it was correctly following the setpoints and a turn was commanded by RC.

@julianoes
Copy link
Contributor Author

julianoes commented Feb 18, 2020

@dagar this looks like the likely cause:

[load_mon] wq:att_pos_ctrl low on stack! (0 bytes left)

And here is some symbols that were in the hardfault log:

Reading symbols from /home/julianoes/Downloads/px4_fmu-v3_default.elf...done.
(gdb) info line *0x08007681
Line 115 of "semaphore/sem_wait.c" starts at address 0x8007680 <nxsem_wait+44> and ends at 0x8007682 <nxsem_wait+46>.
(gdb) info line *0x2002fff8
No line number information available for address 0x2002fff8
(gdb) info line *0x08017e52
Line 97 of "mm_heap/mm_mallinfo.c" starts at address 0x8017e52 <mm_mallinfo+78> and ends at 0x8017e5a <mm_mallinfo+86>.
(gdb) info line *0x8d32b500
No line number information available for address 0x8d32b500
(gdb) info line *0x0819d4ac
No line number information available for address 0x819d4ac <__orb_task_stack_info>
(gdb) info line *0x0819d4ac
No line number information available for address 0x819d4ac <__orb_task_stack_info>
(gdb) info line *0x08142490
No line number information available for address 0x8142490 <_ZTVN4uORB10DeviceNodeE+8>
(gdb) info line *0x081554ee
No line number information available for address 0x81554ee
(gdb) info line *0x081967cc
No line number information available for address 0x81967cc <__orb_cpuload>
(gdb) info line *0x0819d4ac
No line number information available for address 0x819d4ac <__orb_task_stack_info>
(gdb) info line *0x0812725b
Line 46 of "../../platforms/common/px4_work_queue/ScheduledWorkItem.cpp" starts at address 0x812725a <px4::ScheduledWorkItem::schedule_trampoline(void*)>
   and ends at 0x812725c <px4::ScheduledWorkItem::schedule_trampoline(void*)+2>.
(gdb) info line 0x081418ee
Function "0x081418ee" not defined.
(gdb) info line *0x081418ee
No line number information available for address 0x81418ee
(gdb) info line *0x08155540
No line number information available for address 0x8155540 <_ZTVN8load_mon7LoadMonE+68>
(gdb) info line *0x08155528
No line number information available for address 0x8155528 <_ZTVN8load_mon7LoadMonE+44>
(gdb) info line *0x08155504
No line number information available for address 0x8155504 <_ZTVN8load_mon7LoadMonE+8>
(gdb) info line *0x08018317
Line 215 of "pthread/pthread_create.c" starts at address 0x8018316 <pthread_start+78> and ends at 0x801831a <pthread_start+82>.
(gdb) info line *0x08154d7c
No line number information available for address 0x8154d7c <_ZN3px417wq_configurationsL10lp_defaultE>
(gdb) info line *0x080087f0
Line 108 of "semaphore/sem_timedwait.c" starts at address 0x80087f0 <nxsem_timedwait+8> and ends at 0x80087f2 <nxsem_timedwait+10>.
(gdb) info line *0x08127861
Line 216 of "../../platforms/common/px4_work_queue/WorkQueueManager.cpp" starts at address 0x8127860 <px4::WorkQueueRunner(void*)+44>
   and ends at 0x8127862 <px4::WorkQueueRunner(void*)+46>.
(gdb) info line *0x081277cd
Line 162 of "../../platforms/common/px4_work_queue/WorkQueue.cpp" starts at address 0x81277c4 <px4::WorkQueue::Run()+62> and ends at 0x81277ce <px4::WorkQueue::Run()+72>.
(gdb) info line *0x08085771
Line 163 of "../../src/modules/load_mon/load_mon.cpp" starts at address 0x8085770 <load_mon::LoadMon::Run()+8> and ends at 0x8085776 <load_mon::LoadMon::Run()+14>.
(gdb) info line *0x0808547d
Line 193 of "../../src/modules/load_mon/load_mon.cpp" starts at address 0x8085478 <load_mon::LoadMon::_cpuload()+136> and ends at 0x8085480 <load_mon::LoadMon::_cpuload()+144>.
(gdb) info line *0x080853d5
Line 215 of "../../src/modules/load_mon/load_mon.cpp" starts at address 0x80853d4 <load_mon::LoadMon::_ram_used()+8> and ends at 0x80853e4 <load_mon::LoadMon::_ram_used()+24>.
(gdb) info line *0x08017c91
Line 67 of "umm_heap/umm_mallinfo.c" starts at address 0x8017c90 <mallinfo+16> and ends at 0x8017c9a <mallinfo+26>.
(gdb) info line *0x08017e3b
Line 87 of "mm_heap/mm_mallinfo.c" starts at address 0x8017e3a <mm_mallinfo+54> and ends at 0x8017e3e <mm_mallinfo+58>.
(gdb) info line *0x08017e52
Line 97 of "mm_heap/mm_mallinfo.c" starts at address 0x8017e52 <mm_mallinfo+78> and ends at 0x8017e5a <mm_mallinfo+86>.
(gdb) info line *0x08007681
Line 115 of "semaphore/sem_wait.c" starts at address 0x8007680 <nxsem_wait+44> and ends at 0x8007682 <nxsem_wait+46>.
(gdb) info line *0x0800fe13
Line 1118 of "stdio/lib_libvsprintf.c" starts at address 0x800fe0c <lib_vsprintf+1420> and ends at 0x800fe1a <lib_vsprintf+1434>.
(gdb) info line *0x0801008b
Line 66 of "stdio/lib_ultoa_invert.c" starts at address 0x801008a <__ultoa_invert+30> and ends at 0x8010092 <__ultoa_invert+38>.
(gdb) info line *0x0801a3e5
No line number information available for address 0x801a3e5 <__aeabi_uldivmod+36>
(gdb) info line *0x0800f8b9
Line 209 of "stdio/lib_libvsprintf.c" starts at address 0x800f8b0 <lib_vsprintf+48> and ends at 0x800f8c0 <lib_vsprintf+64>.

@julianoes
Copy link
Contributor Author

@Tony3dr ok it was a stackoverflow. Please don't fly fixedwing until #14179 is merged.

@Tony3dr
Copy link

Tony3dr commented Feb 18, 2020

@Tony3dr ok it was a stackoverflow. Please don't fly fixedwing until #14179 is merged.

@julianoes, yes we fly the vehicle on and off and no issues in the past. We are replacing the frame and should have the vehicle ready in the next few days.

@julianoes
Copy link
Contributor Author

@julianoes, yes we fly the vehicle on and off and no issues in the past.

@Tony3dr ok that's good. The issue came because "sideslip fusion failure" was triggered somehow, and that lead to the stack overflow, so it really was a special case which accidentally happened in this flight. The failure did not happen because of this PR here but it would still be good to test this one again.

This is an attempt to fix an edge case in the triplet publication which
can lead to crashes on autopilots with slow SD cards.

The sequence of events before this patch is:
1. Switch to POSCTL when disarmed. At this point current valid with
   setpoint idle is published.
2. Arm, takeoff, and fly using joystick/RC.
3. Switch to RTL (or trigger RTL using RC loss). At this point the
   setpoint is valid but still idle and the motors will shut off.
4. Once navigator has published the new setpoint (which can take up to
   1.5 seconds on slow SD cards) we will hopefully recover.

With this patch we omit this edge case, so we never publish this idle
setpoint when landed. The assumption is that this idle setpoint is no
longer required with the current flight task code, however, that needs
to be further verified.
@julianoes
Copy link
Contributor Author

@Tony3dr this is rebased on master now and can be tested right after you have successfully flown master.

Copy link
Member

@MaEtUgR MaEtUgR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still agreeing to remove overwriting with a valid IDLE in every other mode than takeoff and mission, it clearly was added before we had useful takeoff logic based on if the resulting commands want to go up and is dangerous.

@julianoes
Copy link
Contributor Author

@PX4/testflights please test this one with fixedwing in stabilized and in a mission, thanks. Make sure first without propeller if auto-takeoff and things like that work as expected.

@julianoes
Copy link
Contributor Author

@Tony3dr do you think you'll be able to test it this week?

@Tony3dr
Copy link

Tony3dr commented Mar 12, 2020

@Tony3dr do you think you'll be able to test it this week?

@julianoes, unfortunately, we have rain all week.
https://www.windy.com/32.370/-117.060?32.304,-117.060,11,m:ey6acT3

I'm guessing Monday we can get that tested.

@julianoes
Copy link
Contributor Author

@Tony3dr how's the weather? :)

@Tony3dr
Copy link

Tony3dr commented Mar 17, 2020

@Tony3dr how's the weather? :)

The weather continues to limit us from Test Flying. We are monitoring closely and hopefully, we get to squeeze this PR and others we have pending.
https://www.windy.com/32.370/-117.060?32.304,-117.060,11,m:ey6acT3
@julianoes

@dagar dagar merged commit 3233e07 into master Mar 18, 2020
@dagar dagar deleted the pr-fix-idle-sp branch March 18, 2020 14:50
@julianoes
Copy link
Contributor Author

@Tony3dr this is now merged but it would still be good to do the test with master, thx!

@dannyfpv
Copy link

dannyfpv commented Mar 20, 2020

Tested on Pixhawk 1 v3 fixed-wing

Flight modes:
take off: no issues
mission: no issues
stabilize: no issues

Master daily
log:
https://review.px4.io/plot_app?log=9ccc2004-43b7-4c12-a29c-abc74226af2f

Notes:
Manual takeoff, in Stabilized Mode, we noticed a slight nose down takeoff. switched to Mission mode the vehicle followed all the waypoints and didn't notice any issues in Mission mode. Switched back to Stabilized mode after all the WP were completed, landed the vehicle manually. Overall the flight was good, smooth.

@MaEtUgR
Copy link
Member

MaEtUgR commented Mar 21, 2020

I see the only possible failure cases without an idle setpoint to be arming in Hold mode on the ground or automatic takeoff and landing but I have honestly no idea how that works with fixed-wing. I'm always hand-launching my plane in stabilized mode to avoid this uncertainty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants