Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module is disconnected from edgeHub after some hours #731

Closed
DennisK92 opened this issue Jan 18, 2019 · 11 comments
Closed

Module is disconnected from edgeHub after some hours #731

DennisK92 opened this issue Jan 18, 2019 · 11 comments

Comments

@DennisK92
Copy link

Expected Behavior

The edgeHub should establish a new connection, when it registers that a module is not connected.

Current Behavior

The edgeHub registers that a module is not connected but doesn't establish a new connection.

Steps to Reproduce

  1. Create 3 modules, where module A sends messages (2 sec interval) to module B and module B sends those messages to module C, where messages are sent to upstream. A => B => C => IoT Hub
  2. Deploy on RaspberryPi 3 and let it run for some hours, while everything runs as expected.
  3. At some point module B won't receive messages anymore, because it is not connected.
  4. Messages are not transferred anymore, while all modules are up and running.

Context (Environment)

Device (Host) Operating System

Device: RaspberryPi 3
OS: Raspbian Stretch

Architecture

arm32

Container Operating System

Linux containers

Runtime Versions

iotedged

iotedge 1.0.5

Edge Agent

Agent 1.0.5.19141174

Edge Hub

Hub 1.0.5.19141174

Docker

Docker 3.0.2

Logs

Logs: https://gist.github.com/DennisK92/8e75c3698b03c3ba7abbeb6987a6534a

Additional Information

The dataFeeder creates a message every 2 seconds and sends them to anomalySimulator. This module sends the received messages to edgeMetering, which adds some additional parameters to the message and sends it to upstream to be received by the IoT Hub. However, this only works correctly for a few hours. After that, the module anomalySimulator is not connected anymore and doesn't reconnect. Therefore, messages aren't sent to the IoT Hub anymore.

@MPapst
Copy link

MPapst commented Jan 18, 2019

Sounds a bit like the issue I was having (explained in #673).
Actually it helped upgrading the SDK to the latest version for my C# modules. I think the root cause was fixed here: Azure/azure-iot-sdk-csharp#558 (only applies if your modules are C# )

@DennisK92
Copy link
Author

The latest C# SDK should automatically be pulled when building the modules, if I understand correctly, or is a manual SDK update required? I rebuild all of my modules yesterday and my module still gets disconnected after some hours. What the disconnected module basically does is receive a message and forward it to the next module.

@MPapst
Copy link

MPapst commented Jan 22, 2019

The SDK is just a referenced nuget package, you need to update it.

@DennisK92
Copy link
Author

Okay I set the PackageReference to explicitly use version 1.19.0 now and rebuild all modules. Will update later today/tomorrow whether the module messaging chain keeps the connection alive.

@varunpuranik
Copy link
Contributor

@DennisK92 - yes, SDK update to 1.19.0 should resolve this issue. Please note that you might have to update the module image version and do another deployment to make sure the updated module is picked up by the Edge runtime.

Also, please note that connecting/reconnecting is done by the client (your module) and not by the server (EdgeHub). So this is not a bug in EdgeHub, but in the SDK you are using. The v1.19.0 has the fix for it.

@DennisK92
Copy link
Author

DennisK92 commented Jan 23, 2019

Using the 1.19.0 SDK explicitly, rebuilding all modules and deploying them again solved the connectivity issue.
However, now the edgeHub and my dataFeeder module crashed after some hours.
dataFeeder failed Failed (132) 16 hours ago edgeHub failed Failed (132) 16 hours ago
The dataFeeder module crashed because of an unhandled TimeoutException, which is fine at first but the module isn't restarted by the edgeAgent even though restartPolicy is set to "always". In addition, the edgeHub crashed shortly afterwards because of a bad_alloc and is also not restarted anymore.
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

"OptimizeForPerformance" for edgeHub on resource constrained devices is set to false in the deployment.
"edgeHub": { "type": "docker", "settings": { "image": "mcr.microsoft.com/azureiotedge-hub:1.0", "createOptions": "{\"HostConfig\":{\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]}}}" }, "env": { "OptimizeForPerformance": { "value": "false" } }, "status": "running", "restartPolicy": "always" }

Please see the following logs of edgeHub, edgeAgent and my dataFeeder module.
https://gist.github.com/DennisK92/9fa8c4d7c92affc52ea2db819bb7acde

@varunpuranik
Copy link
Contributor

Interesting, edgeAgent should have restarted the modules. I am not able to access the logs in the link above. Can you please check access on the link?

@DennisK92
Copy link
Author

DennisK92 commented Jan 24, 2019

@varunpuranik Sorry with the link, edited my post. The link is correct now. https://gist.github.com/DennisK92/9fa8c4d7c92affc52ea2db819bb7acde
Today all modules are still running and connected, so this problem doesn't seem to always happen. Since a timeout occured while sending a message, this could have to do something with unstable network service? However this doesn't explain why the modules theren't restarted by the edgeAgent while the edgeAgent itself was running.

Edit: The edgeHub crashed again and then another time as well as my module. However, this time the edgeAgent succesfully restarted the modules. The first crash was because of terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
again. The second crash was a different error.
terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable
edgeHub Log: https://gist.github.com/DennisK92/1c16f2d76f9f5f6b262f03a9505b36dc

This seems to be a problem with the systems memory. Do module logs just keep on writing until the device has no more memory left or are there some mechanisms to prevent that situation? Checking the pi with free -m resulted in the values for Mem =
total: 927, used: 348, free: 38, shared: 52, buff/cache: 540, available: 467
and the values for Swap = total: 99, used: 32, free: 67

@myagley
Copy link
Contributor

myagley commented Feb 19, 2019

It's possible that the Edge Agent didn't restart the Edge Hub because of a deadlock in the dotnet core runtime. This was fixed in 1.0.6. Is it possible to upgrade?

@DennisK92
Copy link
Author

Yes I will upgrade the version to 1.0.6 on every Pi in use. The problem of edgeHub not being restarted actually occurred multiple times on version 1.0.5 in the meantime, so possibly this will fix it.

@DennisK92
Copy link
Author

I'm not encountering this issue anymore. It is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants