-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PX4 v1.10.1 SITL Semaphore Destroy stops #16234
Comments
@lberridge1992 which version of Airsim are you running? Only certain versions of PX4 works with certain versions of Airsim Could you also cross-post this on the Airsim repo for visibility on their side? |
I'm using the latest master branch of Airsim which should be compatible with v1.10.1 of PX4 SITL I believe. My HITL setup works completely fine which suggests the sim is publishing all of the correct sensor data (although I realise that might be comparing apples and oranges!). I've had success running SITL with Airsim using the Unreal simulation environment rather than the Unity one, however, even doing that I am still seeing some intermittent lockstep timeouts and resets due to PX4 not publishing actuator control messages as a result of the sensor uOrb topic never being published due to the execution stalling on the px4_sem_destroy line described above. |
Hi. I got similar issue like lberridge1992 mentioned. PX4 stop sending hil_actuator_controls to Airsim. And my investigation also showed some issue on px4_poll got stuck. I used both master branch of PX4 and Airsim back a few weeks ago. In my case, px4_poll(&fds_actuator_outputs[0], 1, 100); in Simulator::send() in file simulator_mavlink.cpp. But i'm not sure that the reason as you mentioned: stalling on the px4_sem_destroy. I done some debugging on the signal(i think it's _outputs_pub.publish(actuator_outputs); in MixingOutput::setAndPublishActuatorOutputs of file mixer_module.cpp) to wakeup the px4_pool in my case. It's seems fired up normally and regularly after the stuck. I'm still not sure the problem lies in the triggering process or the waken up process. Hi lberridge1992, what method do you explored to make sure it stalling on px4_sem_destroy? |
And also in my case, another px4_poll will get stuck almost the same time. And no such stalling has been seen if i put PX4 on macOS(keep Airsim still on Windows10). |
@kemen209 It is not expected for PX4 master to work with Airsim master: microsoft/AirSim#2477 |
@kemen209 As Jaeyoung-Lim pointed out it is probably worth using the recommended version of PX4 for Airsim which is v1.10.1, that might stabilise some other issues that you might be seeing. The debugging I did to figure out where the poll was failing was crude but I think makes sense. I added some PX4_INFO messages indicating which point execution gets to and in the log when I see my issue occur, the info message after the destroy line is never printed (see image, the if check on timeout is just so I wasn't getting log messages for every poll call as it was logging so much it caused the system to struggle. I adjusted the timeout variable from 50 to 49 in the sensor.cpp poll call to uniquely identify it) I'm interested in the fact that you aren't seeing this issue when running PX4 SITL on macOS and it further suggests that there might be something weird going on with the Cygwin Toolchain on Windows when using Airsim which would explain why macOS works and also HITL seems to function fine. The next thing I'm going to try is running PX4 SITL on a Linux VM with Airsim in Windows, will let you know how it goes. |
@Jaeyoung-Lim Thanks for pointing out. Before i going with master branch of both, i already tried the combination of Airsim (v1.3.1) and PX4(v1.10.2). I deployed to multiple machines(desktop and laptop Win10), the UE4Editor.exe will crash at ntdll.dll module when i try to launch it with Block environment under debug mode, except for one desktop. I take days to try a lot of things: different version of Windows, reinstall NVIDA drivers, packaging a stand-alone block application and so on. Still got no idea why it crashed, and why the one desktop exception. I had to given up. And try some new version: Both recent version: The crash of UE4Editor.exe never came out again. And i can use airsim(with PX4 in SITL mode) to take-off, land and circling until i stump into the px4_poll hang situation. |
@kemen209 As you can see in microsoft/AirSim#2477, v1.3.1 was NOT working with v1.10.2 or v1.10.1 and while it was fixed, there were no new linux releases that happened from the Airsim side. So the only way is to hunt down the fix by looking at the conversation in the issue |
Yesterday, a PR was merged ( microsoft/AirSim#3156 ) to solve the communication error with the latest PX4 version. So, please try with the latest AirSim commit (a5115256257d75a52d2b2a8641eb0c6a81089bfe) |
@lberridge1992 You idea on px4_sem_destroy is a fresh angle to me. I tried a lot test on the notify side(part of topic publish) - thought it's the signal never made to the wait side, or the semaphore mechanism had some thing wrong - px4_sem_timedwait never weakup. I will run some test on px4_sem_destroy idea. And also test commit mentioned by @jonyMarino . |
@Jaeyoung-Lim Thanks, i aware of issue in the microsoft/AirSim#2477, like sensor timeout. But i made some change to airsim to be more lock-stepped(like using a SteppableClock, and some other changes). For me now, the only problem is PX4 stop sending out hil_actuator_controls to Airsim even if Airsim keep sending sensor update to PX4(by using some debuging print/log). This usually happened a few minites after these two connected(during takeoff, or circling). |
As an update to this, I've tested running PX4 SITL on a Linux VM hosted using Virtual Box and don't get the behaviour I was experiencing with the Cygwin toolchain when using the Unity version of Airsim. Not sure I completely understand what was causing it to stop executing PX4 SITL randomly but this gives me a good alternative setup. |
@lberridge1992 I have re-test your idea on px4_sem_destroy in my env, but that not my case. The PX4 stuck on px4_sem_timedwait(no luck event with change mentioned by @jonyMarino ) in function px4_poll() instead of px4_sem_destroy like yours. And in one of my test, i even saw some weird warning like this:
It print out as warning, but the error message say: No error! I start to wonder if there is something wrong with the Cygwin. And i found another issue may also related to Cygwin toolchain: Fail to shutdown PX4 in SITL mode with Airsim#16253. |
I don't think this issue is entirely related to AirSim in the picture. My team has seen the issue with jMAVSim in Windows using v1.10.2.
|
@Ankur1014 You said you find some px4_poll issue similar to mine. Do you mean this: Module A keep publishing data(ORB_ID(actuator_outputs)), but Module B stuck on px4_poll ?
And my test patch is similar to this:
|
@kemen209 , yes. I replaced the |
I've been tracing an issue when using PX4 SITL (v1.10.1) with Airsim and have narrowed it down to what I believe is an issue with destroying the semaphore when trying to poll for sensor data. I'm running SITl on Windows 10 with the Cygwin Toolchain in lockstep mode. I've tried disabling lockstep and observe the same behaviour.
Randomly during flight (anywhere between 1 and 15+ minutes) the local position and velocity and global position uOrb topics stop being published which causes the data to become stale and the drone to fall out of the sky in the simulator. I added some additional logging into the code to see exactly what condition was being triggered just to confirm these topics haven't been published:
These topics are published by the EKF2 module which appears to stop running and therefore doesn't publish the topics. I realised that the EFK2 work item only runs when the sensor combined topic is published and this eventually led me to investigate why line 508 in sensor.cpp is never being called. It appears that the execution of the Run function pauses or stalls on line 460 (apologies for some crude debugging logs, I am never seeing the log of the number 3 at the time the issue occurs and the uOrb topics stop being published which is how I confirmed execution stalls at this point)
Debugging in the px4_poll function in cdev_platform.cpp suggests that the function executes correctly up until line 426
px4_sem_destroy(&sem);
at which point it gets stuck resulting in the px4_poll never returning and eventually the sensor topic not being published. I'm not seeing any errors or exceptions in the terminal to suggest the code has crashed, unless these aren't being output for some reason.Does anyone have any ideas on what could be causing this issue? What makes it more curious is that if I add in lots of logging statements it seems to stop the issue happening and I've been able to fly for around 50 minutes without any issues, but when I click in the PX4 console which selects a specific log message the issue immediately occurs. This led me to believe it might be an issue with the Cygwin toolchain?
The text was updated successfully, but these errors were encountered: