-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModuleClient with MQTT TCP does not re-connect if the connection is dropped by the server #558
Comments
Thanks @varunpuranik for reporting this! |
This should be considered as a blocker since this is already in GA. Not having long running tests for iot "solution" is nothing but ridiculous. |
I was able to repro this using the MQTT fault injection tests which appear to have been disabled for a while (meaning that this is probably not working for DeviceClient to IoT Hub either). |
@CIPop If the module died and was restarted automatically by EdgeAgent, this might have been acceptable. Unfortunately, this is not the case. The Module simply hangs after disconnecting, doing nothing. A user needs to notice that this has happened, and then restart the module. I doubt if that will be acceptable to many customers. |
I'd say absolutely not. Isn't that the point of the agent, and the settings in the portal - to restart modules based on those settings (healthy, etc). The point of the edge hub to keep the channels open and alive.
If we have to watch the health of our module apps, and babysit the pipeline to the hub - why edge at all, then?
…________________________________
From: Varun Puranik <[email protected]>
Sent: Saturday, August 4, 2018 6:59 PM
To: Azure/azure-iot-sdk-csharp
Cc: Subscribed
Subject: Re: [Azure/azure-iot-sdk-csharp] ModuleClient with MQTT TCP does not re-connect if the connection is dropped by the server (#558)
@CIPop<https://github.com/CIPop> If the module died and was restarted automatically by EdgeAgent, this might have been acceptable. Unfortunately, this is not the case. The Module simply hangs after disconnecting, doing nothing. A user needs to notice that this has happened, and then restart the module. I doubt if that will be acceptable to many customers.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#558 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APvMfN76fF-vYb11-iQGB2i9eBfii-c0ks5uNjVpgaJpZM4VZTjg>.
|
I've started investigating and found that most of the tests for MQTT and AMQP recovery have been disabled a long time ago. After re-enabling them I do see they are failing to reconnect after various types of faults. |
We've already escalated a ticket because of that problem since this is business critical and is blocking our release. |
We believe this issue has been fixed in DotNetty here: Azure/DotNetty#413 We need to cut a new release of DotNetty and update the SDK. |
Version 1.18.0 of the SDK has been released to nuget.org. Please give it a try. |
This works now with SDK version 1.18.0. Closing this issue. |
@pavele, @jason-e-gross, @myagley, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey |
Reopening as we still have TODO items in our E2E tests that need to be investigated. |
So far, the tests that were disabled were failing because of issues on the service-side fault injection support. I'm working on fixing all test flavors but so far all connections seem to recover without any SDK changes. |
I was able to re-enable all Telemetry send/receive fault recovery tests in our CI system but not the Command and Twin. |
is there any update on this? I have a customer stating they are running into this - and it sounds here like it is still expected to happen on Edge |
My team is also being blocked by this issue. Updates are appreciated. |
…ure#558) Add functionality along with tests in the Edge Hub to obtain the trust bundle either from the iotedged or an input file to facilitate development. The trust bundle is not really used here rather this is being staged for additional upcoming features.
@CIPop should this issue now be fully fixed with the referenced PR - and Microsoft.Azure.Devices.Client 1.19.0? |
Can you give it a try and let us know if you still see issues. |
I've already updated and am testing this. I just wanted to check if this is actually something that should have been fully solved - since the PR mentions that more fixes/changes are to come. |
Our current release addresses a huge set of fixes. But yes, we still have few more coming soon on the AMQP. We are working on redesigning AMQP to address lot of issues on that front as well. In the meantime, if you can let us know if our 1.19.0 release resolved your issues or if you still see problem ? |
I finally had time to test this. And it does not look to me as this has been fixed for MQTT. C# SDK: 1.19.0 My setup:
When all is running, I manually restart the edgeHub container (iotedge restart edgeHub). Both modules detect the connection change.
The MQTT module, however, does not reconnect. It does run into the Retry-Expired timeout:
To work around this, I am exiting the module when the retry_expired happens. Then the edgeAgent restarts the module and it connects fine again. |
I tried this and cannot repro it anymore. Closing due to inactivity. Please reopen if you see any issues. |
OS: Ubuntu 16.04, x64
.Net Core 2.1
Device: Azure VM
SDK Version: 1.17.0
Description of the issue:
When a module uses the ModuleClient with MQTT Tcp connection to connect to the EdgeHub, the server (EdgeHub) drops the connection when it needs a new token. The expectation is that the ModuleClient will re-open this connection to the server and continue the operation.
However, every so often, the ModuleClient will never re-open the connection to the EdgeHub, which leaves a stranded module which doesn't do anything. This happens more frequently when the Module is only Receiving messages.
The ModuleClient does not even throw any exceptions, it simply loses the connection to the EdgeHub and stands still, not doing anything.
Repro steps -
The text was updated successfully, but these errors were encountered: