Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PX4 v1.10.1 SITL Semaphore Destroy stops #16234

Open
lberridge1992 opened this issue Nov 19, 2020 · 16 comments
Open

PX4 v1.10.1 SITL Semaphore Destroy stops #16234

lberridge1992 opened this issue Nov 19, 2020 · 16 comments
Labels
Sim: airsim Sim: SITL software in the loop simulation

Comments

@lberridge1992
Copy link

lberridge1992 commented Nov 19, 2020

I've been tracing an issue when using PX4 SITL (v1.10.1) with Airsim and have narrowed it down to what I believe is an issue with destroying the semaphore when trying to poll for sensor data. I'm running SITl on Windows 10 with the Cygwin Toolchain in lockstep mode. I've tried disabling lockstep and observe the same behaviour.

Randomly during flight (anywhere between 1 and 15+ minutes) the local position and velocity and global position uOrb topics stop being published which causes the data to become stale and the drone to fall out of the sky in the simulator. I added some additional logging into the code to see exactly what condition was being triggered just to confirm these topics haven't been published:
image

These topics are published by the EKF2 module which appears to stop running and therefore doesn't publish the topics. I realised that the EFK2 work item only runs when the sensor combined topic is published and this eventually led me to investigate why line 508 in sensor.cpp is never being called. It appears that the execution of the Run function pauses or stalls on line 460 (apologies for some crude debugging logs, I am never seeing the log of the number 3 at the time the issue occurs and the uOrb topics stop being published which is how I confirmed execution stalls at this point)

image

Debugging in the px4_poll function in cdev_platform.cpp suggests that the function executes correctly up until line 426 px4_sem_destroy(&sem); at which point it gets stuck resulting in the px4_poll never returning and eventually the sensor topic not being published. I'm not seeing any errors or exceptions in the terminal to suggest the code has crashed, unless these aren't being output for some reason.

Does anyone have any ideas on what could be causing this issue? What makes it more curious is that if I add in lots of logging statements it seems to stop the issue happening and I've been able to fly for around 50 minutes without any issues, but when I click in the PX4 console which selects a specific log message the issue immediately occurs. This led me to believe it might be an issue with the Cygwin toolchain?

@Jaeyoung-Lim
Copy link
Member

@lberridge1992 which version of Airsim are you running? Only certain versions of PX4 works with certain versions of Airsim

Could you also cross-post this on the Airsim repo for visibility on their side?
@jonyMarino FYI

@lberridge1992
Copy link
Author

I'm using the latest master branch of Airsim which should be compatible with v1.10.1 of PX4 SITL I believe. My HITL setup works completely fine which suggests the sim is publishing all of the correct sensor data (although I realise that might be comparing apples and oranges!).

I've had success running SITL with Airsim using the Unreal simulation environment rather than the Unity one, however, even doing that I am still seeing some intermittent lockstep timeouts and resets due to PX4 not publishing actuator control messages as a result of the sensor uOrb topic never being published due to the execution stalling on the px4_sem_destroy line described above.

@Jaeyoung-Lim Jaeyoung-Lim added Sim: airsim Sim: SITL software in the loop simulation labels Nov 19, 2020
@kemen209
Copy link

kemen209 commented Nov 20, 2020

Hi. I got similar issue like lberridge1992 mentioned. PX4 stop sending hil_actuator_controls to Airsim. And my investigation also showed some issue on px4_poll got stuck.

I used both master branch of PX4 and Airsim back a few weeks ago.

In my case, px4_poll(&fds_actuator_outputs[0], 1, 100); in Simulator::send() in file simulator_mavlink.cpp.

But i'm not sure that the reason as you mentioned: stalling on the px4_sem_destroy.

I done some debugging on the signal(i think it's _outputs_pub.publish(actuator_outputs); in MixingOutput::setAndPublishActuatorOutputs of file mixer_module.cpp) to wakeup the px4_pool in my case. It's seems fired up normally and regularly after the stuck. I'm still not sure the problem lies in the triggering process or the waken up process.

Hi lberridge1992, what method do you explored to make sure it stalling on px4_sem_destroy?

@kemen209
Copy link

And also in my case, another px4_poll will get stuck almost the same time.
It's int pret = px4_poll(fds, 1, 20); in Logger::run() of file logger.cpp, and the it's trigger source is: _ekf2_timestamps_pub.publish(ekf2_timestamps); in EKF2::Run() of file EKF2.cpp.

And no such stalling has been seen if i put PX4 on macOS(keep Airsim still on Windows10).

@Jaeyoung-Lim
Copy link
Member

@kemen209 It is not expected for PX4 master to work with Airsim master: microsoft/AirSim#2477

@lberridge1992
Copy link
Author

lberridge1992 commented Nov 20, 2020

@kemen209 As Jaeyoung-Lim pointed out it is probably worth using the recommended version of PX4 for Airsim which is v1.10.1, that might stabilise some other issues that you might be seeing.

The debugging I did to figure out where the poll was failing was crude but I think makes sense. I added some PX4_INFO messages indicating which point execution gets to and in the log when I see my issue occur, the info message after the destroy line is never printed (see image, the if check on timeout is just so I wasn't getting log messages for every poll call as it was logging so much it caused the system to struggle. I adjusted the timeout variable from 50 to 49 in the sensor.cpp poll call to uniquely identify it)

image

I'm interested in the fact that you aren't seeing this issue when running PX4 SITL on macOS and it further suggests that there might be something weird going on with the Cygwin Toolchain on Windows when using Airsim which would explain why macOS works and also HITL seems to function fine. The next thing I'm going to try is running PX4 SITL on a Linux VM with Airsim in Windows, will let you know how it goes.

@kemen209
Copy link

@Jaeyoung-Lim Thanks for pointing out. Before i going with master branch of both, i already tried the combination of Airsim (v1.3.1) and PX4(v1.10.2). I deployed to multiple machines(desktop and laptop Win10), the UE4Editor.exe will crash at ntdll.dll module when i try to launch it with Block environment under debug mode, except for one desktop.

I take days to try a lot of things: different version of Windows, reinstall NVIDA drivers, packaging a stand-alone block application and so on. Still got no idea why it crashed, and why the one desktop exception. I had to given up. And try some new version:

Both recent version:
PX4(commit: d5245a2), with Airsim(commit: d59ceb7f63878f5e087ea802d603ba0fd282ff56).

The crash of UE4Editor.exe never came out again. And i can use airsim(with PX4 in SITL mode) to take-off, land and circling until i stump into the px4_poll hang situation.

@Jaeyoung-Lim
Copy link
Member

@kemen209 As you can see in microsoft/AirSim#2477, v1.3.1 was NOT working with v1.10.2 or v1.10.1 and while it was fixed, there were no new linux releases that happened from the Airsim side. So the only way is to hunt down the fix by looking at the conversation in the issue

@jonyMarino
Copy link

Yesterday, a PR was merged ( microsoft/AirSim#3156 ) to solve the communication error with the latest PX4 version. So, please try with the latest AirSim commit (a5115256257d75a52d2b2a8641eb0c6a81089bfe)

@kemen209
Copy link

@lberridge1992 You idea on px4_sem_destroy is a fresh angle to me.

I tried a lot test on the notify side(part of topic publish) - thought it's the signal never made to the wait side, or the semaphore mechanism had some thing wrong - px4_sem_timedwait never weakup.

I will run some test on px4_sem_destroy idea. And also test commit mentioned by @jonyMarino .

@kemen209
Copy link

@Jaeyoung-Lim Thanks, i aware of issue in the microsoft/AirSim#2477, like sensor timeout. But i made some change to airsim to be more lock-stepped(like using a SteppableClock, and some other changes).

For me now, the only problem is PX4 stop sending out hil_actuator_controls to Airsim even if Airsim keep sending sensor update to PX4(by using some debuging print/log). This usually happened a few minites after these two connected(during takeoff, or circling).

@lberridge1992
Copy link
Author

As an update to this, I've tested running PX4 SITL on a Linux VM hosted using Virtual Box and don't get the behaviour I was experiencing with the Cygwin toolchain when using the Unity version of Airsim. Not sure I completely understand what was causing it to stop executing PX4 SITL randomly but this gives me a good alternative setup.

@kemen209
Copy link

kemen209 commented Nov 24, 2020

@lberridge1992 I have re-test your idea on px4_sem_destroy in my env, but that not my case. The PX4 stuck on px4_sem_timedwait(no luck event with change mentioned by @jonyMarino ) in function px4_poll() instead of px4_sem_destroy like yours.

And in one of my test, i even saw some weird warning like this:

WARN  [cdev] logger: px4_poll() sem error: No error
WARN  [cdev] navigator: px4_poll() sem error: No error
WARN  [cdev] sim_send: px4_poll() sem error: No error

It print out as warning, but the error message say: No error!

I start to wonder if there is something wrong with the Cygwin. And i found another issue may also related to Cygwin toolchain: Fail to shutdown PX4 in SITL mode with Airsim#16253.

@ghost
Copy link

ghost commented Nov 26, 2020

I don't think this issue is entirely related to AirSim in the picture. My team has seen the issue with jMAVSim in Windows using v1.10.2.
Our observations are similar to what both @lberridge1992 and @kemen209 have reported.

  • EKF2 stops after some time causing no new data to be published for vehicle_local_position and vehicle_global_position.

  • Our debugging pointed us to the px4_poll code in simulator_mavlink.cpp as @kemen209 mentioned.

  • One of my team-mates went further in the lockstep scheduling and I guess he had similar observations with px4_sem like @lberridge1992

We reported #15437 and #15446 for the issues we observed.

@kemen209
Copy link

@Ankur1014 You said you find some px4_poll issue similar to mine. Do you mean this:

Module A keep publishing data(ORB_ID(actuator_outputs)), but Module B stuck on px4_poll ?

Module A:
// src/lib/mixer_module/mixer_module.cpp
void
MixingOutput::setAndPublishActuatorOutputs(unsigned num_outputs, actuator_outputs_s &actuator_outputs)
{
      // .......
      _outputs_pub.publish(actuator_outputs);
}

Module B:
// src/modules/simulator/simulator_mavlink.cpp
void Simulator::send()
{
    // ......
    int pret = px4_poll(&fds_actuator_outputs[0], 1, 100);
   // .......
}

And my test patch is similar to this:

diff --git a/src/lib/mixer_module/mixer_module.cpp b/src/lib/mixer_module/mixer_module.cpp
index 153f7a51fa..d693440c62 100644
--- a/src/lib/mixer_module/mixer_module.cpp
+++ b/src/lib/mixer_module/mixer_module.cpp
@@ -435,7 +435,10 @@ MixingOutput::setAndPublishActuatorOutputs(unsigned num_outputs, actuator_output
 	}
 
 	actuator_outputs.timestamp = hrt_absolute_time();
-	_outputs_pub.publish(actuator_outputs);
+
+	PX4_INFO("mix");
+	bool t = _outputs_pub.publish(actuator_outputs);
+	PX4_INFO("mix done: %d", t);
 }
 
 void
diff --git a/src/modules/simulator/simulator_mavlink.cpp b/src/modules/simulator/simulator_mavlink.cpp
index 7951b1a8a5..3910673c98 100644
--- a/src/modules/simulator/simulator_mavlink.cpp
+++ b/src/modules/simulator/simulator_mavlink.cpp
@@ -189,6 +189,7 @@ void Simulator::send_controls()
 
 		PX4_DEBUG("sending controls t=%ld (%ld)", _actuator_outputs.timestamp, hil_act_control.time_usec);
 
+		PX4_INFO("sending");
 		send_mavlink_message(message);
 	}
 }
@@ -610,6 +611,7 @@ void Simulator::send()
 	while (true) {
 
 		// Wait for up to 100ms for data.
+		PX4_INFO("polling");
 		int pret = px4_poll(&fds_actuator_outputs[0], 1, 100);
 
 		if (pret == 0) {
@@ -622,6 +624,7 @@ void Simulator::send()
 			continue;
 		}
 
+		PX4_INFO("polled");
 		if (fds_actuator_outputs[0].revents & POLLIN) {
 			// Got new data to read, update all topics.
 			parameters_update(false);

@ghost
Copy link

ghost commented Nov 26, 2020

@kemen209 , yes.

I replaced the px4_poll in simulator_mavlink.cpp with usleep and I see better performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sim: airsim Sim: SITL software in the loop simulation
Projects
None yet
Development

No branches or pull requests

4 participants