-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modules all restarted automatically, causing modules to no longer communicate. #65
Comments
Some additional info: |
@varunpuranik Could you please take a look? @Hammatt do you have iotedged logs from before timestamp 22:55:53 |
@dsajanice I sure do, here you go:
I've got logs for days and days before this, lots of it seems to be the same though so I've just included a snippet close to the event here. let me know if you need more. |
Quick note too. I haven't started to implement #60 yet so it's not related to that issue. |
Hi @Hammatt, I tried to repro your scenario, and i am not able to reproduce it - the modules start and are able to send messages successfully. Since you seem were able to repro it on a couple of boxes, can you try to set the log level to debug. Maybe that will tell us more. You can set the environment variable RuntimeLogLevel to debug on the edgeAgent module to enable debug logs (for both EdgeAgent and EdgeHub). Also, are you only seeing this when the system restarts after several days? What happens if you restart the system manually by doing a $>sudo systemctl restart iotedge? Are the modules able to come up and start send messages successfully? |
Hi @varunpuranik, It's not so much that I can reproduce it on multiple devices, but more that it just randomly happened! I don't think that I would be able to easily make this happen again. This is the only time I've seen this issue. one machine was up for 7 days and the other was up for 6 days, both had all of their modules randomly restart without me doing anything at the exact same time and when they came back up, the modules would not talk to eachother. I have restarted one of the machines and all of the modules start up as expected and can message eachother, but in prod we're not really going to have access to these to restart them ourselves like that, so it would be good to understand why. |
Yes, I understand. Okay, so another question - in that state (after restarting and when no messages were being passed), were your modules trying to send messages, and failing? From the logs of one of the modules that you shared, I can see that it initialized successfully, but it is not clear what happened after that. If they were getting any errors from the server, those might be useful. If they weren't even trying to send messages, we can look at why. |
Logs of the server module sending some messages: 24-07-2018 20:32:57.23 Information: Edge Message Sent event logged. gatewayId: DevDevice. module: servermodule. which means I know The cloud upload module logs are very boring for this whole time period and just puts out heartbeats. Here are the logs from the edgeHub from this time:
This seems to me like the module sending messages is managing to correctly send them, but the module receiving the messages is not receiving them. Any idea what might cause this to happen after a module restart when it usually would work perfectly? perhaps some kind of race condition with connecting to the edgeHub? |
What transport protocols are the sending and receiving modules using? |
Both sending and receiving modules have the following transport settings: private static ITransportSettings[] PreInitialize()
{
// Cert verification is not yet fully functional when using Windows OS for the container
bool myBypassCertVerification = true;// RuntimeInformation.IsOSPlatform(OSPlatform.Windows);
if (!myBypassCertVerification)
{
CertUtils.InstallCert();
}
var myAmqpSetting = new AmqpTransportSettings(TransportType.Amqp_Tcp_Only);
// During dev you might want to bypass the cert verification. It is highly recommended to verify certs systematically in production
if (myBypassCertVerification)
{
myAmqpSetting.RemoteCertificateValidationCallback =
(inSender, inCertificate, inChain, inSslPolicyErrors) => true;
}
ITransportSettings[] mySettings = { myAmqpSetting };
return mySettings;
} where PreInitialize is usedlike this: ITransportSettings[] mySettings = PreInitialize();
ModuleClient myModuleClient = await ModuleClient.CreateFromEnvironmentAsync(mySettings); You might notice that they're always set to bypass certs, this is because the |
I believe with the information we have it is hard to say why the messages are not going through to the other modules. Can you try turning on debug logs? That might tell us something. However, note that the EdgeHub logs get quite large, so you might want to set docker to rotate the logs. As for the EdgeModuleCACertificateFile, that is now completely handled by the SDK via the CreateFromEnvironmentAsync call. So if this is a module, you don't need to worry about checking for the cert and bypassing the cert validation, etc. (Note that the story is a bit different for downstream leaf devices connecting to EdgeHub in a Gateway scenario) |
Hi @varunpuranik
Surprisingly enough, the messages are still not flowing through, even though the edgeHub module has been restarted. I still haven't restarted the clouduploadmodule. I guess from these it's pretty obvious that for some reason the cloudupload module is not actually connected to the messaging service for some reason. The follow method was ran in the cloud upload module and didn't return any errors, and all the logs showed up as expected: private static async Task InitAsync(IModuleClient inDeviceClient)
{
// Open a connection to the Edge runtime
await inDeviceClient.OpenAsync().ConfigureAwait(false);
await deviceLogger.WriteGenericLogAsync($"...IoT Hub module client initialized");
//Initialise the dicom private dictionary.
PrivateDictionary.Initialize();
await deviceLogger.WriteGenericLogAsync($"...dicom private dictionary initialized");
// Attach callback for Twin desired properties updates
await inDeviceClient.SetDesiredPropertyUpdateCallbackAsync(OnDesiredPropertiesUpdateAsync,
inDeviceClient).ConfigureAwait(false);
await deviceLogger.WriteGenericLogAsync($"...desired property update callback initialized");
// Register callback to be called when a message is received by the module
await inDeviceClient.SetInputMessageHandlerAsync("FileAvailableToUpload", UploadFileAsync,
inDeviceClient).ConfigureAwait(false);
await deviceLogger.WriteGenericLogAsync($"...edge local message handler initialized");
} Also, thanks for the FYI about the certs! |
Yes, looks like the module is not connected. But this seems to be because the module is not even aware that the connection has dropped (I have seen that happen sometimes, seems like a bug in the SDK). Can you restart the clouduploadmodule, that should start the messages. |
Thanks, I've issued a restart to that module and it has fired up again, it's somewhat of concern that this wasn't self-healing. I guess that we could do this remotely by creating a blank manifest and deploying that, and then issuing the old manifest back, making the modules restart. I'll leave debug logging on incase anything happens again so that we can check. |
Hi @varunpuranik, I've been messing about trying to see if I can find a scenario where I can recreate this behavior and I may have found something with the help of somebody else here. We've found that if you manually restart the edgeHub module after the other modules are up, it does not correctly route messages. steps:
I wonder if this is what happened to me, and one of the modules simply came up before the edgeHub module was ready? |
So, typically what seems to happen (and more with clients using MQTT than AMQP) is that the client does not connect successfully with EdgeHub (it either times out, or fails to authenticate). But if the clients do connect, then messages should flow properly. Can you please check EdgeHub logs to see if the modules between which you expected messages to flow were indeed connected to the EdgeHub? (Or if you copy the logs here, I can take a look as well) |
Hi, sorry for the delay. I couldn't remote into the dev machine from home for some reason. it appears that when the edgeHub is forcefully restarted, the consuming module does not manage to re-connect to it by itself.
Is it the responsibility of the custom module (or maybe the sdk?) to know that the hub module has restarted somehow and to re-connect? or should the hub module reconnect itself somehow? |
Yes, it is indeed the responsibility of the client to reconnect... so in this case it would be the custom module. And yes, if you are using the SDK, I would expect it to do that. Feel free to log a github issue against the SDK if you feel that is the case. |
I'm going to close this issue due to inactivity, but please feel free to reopen. Thanks. |
Expected Behavior
I would expect that modules should be able to stay up indefinitely, with no intervention, and be-self healing in that they should properly be able to stand back up if they stop for some reason.
Current Behavior
Last night, all running modules on a test device (including edgeAgent and edgeHub module) restarted at the same time for some reason which is unclear to me. Since restarting, the modules have stopped sending messages to eachother over the local edge infrastructure using
ModuleClient.SendEventAsync
.Steps to Reproduce
I wish I knew. Our modules had been all running for 7 days, mostly under heavy load / soak testing.
Context (Environment)
two basic modules, a "server module" and a "cloud upload module", the server module accepts a proprietary type of connection (layer over TCP) and then saves a file which is streamed to it. this module then sends a message to the cloud upload module telling it the location of the file that it has saved (on a shared volume) and the file is then uploaded to the cloud for processing. This has been working well and this is the first time I've seen an issue like this after running this setup for a number of months (albeit normally restarting the devices more frequently than this!)
Device (Host) Operating System
Ubuntu 18.04 Server.
Architecture
amd64
Container Operating System
Linux containers
Runtime Versions
iotedged
iotedge 1.0.0 (52ef77d)
Edge Agent
1.0.0.14487025 (52ef77d)
Edge Hub
1.0.0.14487025 (52ef77d)
Docker
Logs
I believe that the thing that caused the modules to restart may have been the daemon. here are the logs of that for the time period:
(I've taken out the genid because its being used to encrypt/decrypt/sign so I thought it might be counted as a secret, but let me know if it is actually needed to debug this).
We can see the stop requests coming through to the modules:
edgeAgent:
edgeHub:
And one of the modules:
Additional Information
All of the modules started back up, but none of them seem to be communicating to eachother. the servermodule is still receiving connections from the outside (from TCP connections coming in from outside the host system), and thinks that it is sending messages accross the edge infrastructure, but they're never showing up in the other module.
also of note:
There are two separate issues here in my eyes,
I will try to leave this device in a broken state for some time in case there are any other logs that we might need.
The text was updated successfully, but these errors were encountered: