-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logger: add watchdog to boost its priority in case of a busy-looping task #9404
Conversation
…ked for >1s This is to guard against busy-looping tasks, so that we at least have a log. In case the watchdog is triggered, it also logs CPU load of all tasks.
In case a log ends abruptly, we will know that we have everything up to the last second. A test showed that CPU load and the amount of logging drops are unaffected by this.
Helps debugging busy-looping tasks.
avoids compile error on OSX, trying to cast pthread_t to pid_t
perf_name = "perf_top_preflight"; | ||
break; | ||
case PrintLoadReason::Postflight: | ||
perf_name = "perf_top_postflight"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I'm curious to see how often this is happens in a regular "good" flight. |
Is anything else in the system intentionally scheduled at the max priority? What about restoring the original priority after a relatively long period? |
Well, it should not! :)
Not that I know of.
I want to keep the logic as simple as possible, as it's something that is supposed to never happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkueng This looks good.
I would be concerned about using non ISR safe system resources off the call back.
Also comment about the intentional lack of restore would be helpful.
static int counter = 0; | ||
|
||
if (++counter > 300) { | ||
PX4_ERR("max time in ready: %i ms", (int)max_time / 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is printf called off of this in interrupt context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I thought this was safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recollection is that there is a possibility of a wait on semaphore in the code path of PX4_ERR. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, printf
calls lib_sem_initialize
which uses a semaphore. What are you using for such cases?
In any case, thanks for the hint. I removed them now.
PX4_ERR calls printf, which calls lib_sem_initialize, which calls sem_wait
Regarding more frequent
|
@bkueng what do you attribute the dropouts to? |
I enabled the high-rate logging profile, which considerably increases the logging data rate (around 110kb/s vs around 40kb/s). Plus I used a cheap SD card. |
Does this approach 0 with a high quality card at an adequate rate or is there something we need to have deeper look at? |
Yes, I have much less drops with a Sandisk Extreme U3, even with the high-rate logging. |
@PX4/testflights can you test this please? And check that logging works as usual. |
couple flights on a Pixracer (V4) using a typical class 4 SanDisk microSD card Only issue i encountered was that in the first flight, there was a sudden pause after engaging Position Mode at minute 1:46 Second flight seemed normal, no issues were observed |
Flights with pixhawk 2.1 (V3) using SdCard sandisk edge, 8gb, C4. |
couple flights on a Pixhawkmini (V3) I used a Kingston SD card, class 4, 8GB. |
Couple flights on a Pixracer (V3) Pixhawk 2 using a no brand 8GB CLASS 4 SD CARD https://logs.px4.io/plot_app?log=df7cc843-6302-4790-9903-619cab15a079 https://logs.px4.io/plot_app?log=7661a01f-3397-4eac-8543-774d6d0d888e No Issues during flight. |
@bkueng Please review before merging. |
I inspected all logs, the watchdog never got triggered.
At 1:46, there was a very short commanded (RC stick input change) switch into mission mode, and then back to position. This explains the pause. |
@LorenzMeier good to merge? |
Thanks, I reviewed the full state and its great, thanks! |
In case a task with higher prio than the log file writer starts to busy-loop, the log just stops. And since the writer runs at very low prio, most tasks can cause it.
So this PR adds a watchdog that checks from a timer interrupt if the log writer task has been in ready state but has not been scheduled over the last second. This means that someone else is hogging the CPU and we boost the priority of the logger and writer to maximum.
In case the wachdog triggers, logger will also log the CPU load of all running tasks.
The watchdog timeout is fairly low at 1 second. It will only give a false positive if there is a long-running computation of another task without interruption (which we should avoid anyway). But even if it triggers, the system will continue to work fine.
Testing
This might help with debugging #9271, since the log stops in-air. There might be different causes for the different reports, and in some cases a busy-looping task could be the cause.
Careful review required.