-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
edgeHub race condition on startup - modules cannot send messages for 5 minutes after device deployment #5398
Comments
This error can also be seen when deploying and deleting a deployment with no modules. When deleting the deployment,
n.b.
[0] Desired properties patch according to // baseline:
{"routes":{},"schemaVersion":"1.1","storeAndForwardConfiguration":{"timeToLiveSecs":7200},"$version":24}
// patch:
{"schemaVersion":null,"storeAndForwardConfiguration":null,"$version":25}
// combined:
{"routes":{}, "$version":25} |
@onalante-msft - thank you for the follow up. I am glad to know that the issue has been confirmed. Please post an update when a solution is in the works, especially once target release information becomes available. |
@gri6507 Would it be possible for you to post debug logs (with sensitive information redacted)? You can request debug logs by setting (as part of the deployment) Thank you for your patience. |
@onalante-msft - Sorry for the detailed response. Below are the logs with Bad case
Good case
|
@gri6507 This issue is probably part of a larger class of edge cases (no pun intended) related to module twins. However, I think we may have a workaround for now. Could you try this image for your edgeHub? You should be able to pull without a password: |
@onalante-msft - I'm afraid, the problem is still there. Here's what
Bad case
Good case
|
Understood. I will generate a new image for you with another change. Just to be transparent with you, here is what we (myself and a colleague) are thinking about this issue. Edit: see subsequent comment for caveat. In your case, it is most likely as you described: a race condition wherein DPS has created the device in the Hub, but the Hub has not assigned the deployment to the device. However, the iotedge daemon still has a cached module identity, and starts up with this identity. When it tries to then merge the upstream configuration with this cached identity, we have the error you see here. The reason why my first comment's scenario elicits the same error is likely for a similar reason in that the upstream module identity is deleted while the daemon has a valid identity. While the upstream IoT Hub is always the source of truth for identities, we do have to allow for local identity caching to enable offline operation of edge devices. At the same time, we think that edgeHub should not start in case it cannot find an appropriate module identity. So, we have to distinguish between "cannot connect to the IoT Hub" and "can connect, but cannot find the identity." I am tweaking some of the error-handling code to see which places are appropriate to make this distinction. |
Alright, could you retry this image: The whole situation is complicated by the fact that IoT Hub seems to explicitly set desired properties to Edit: source branch is |
@onalante-msft - The very first run with the new $ iotedge list | grep edgeHub; docker images | grep edgehub
edgeHub running Up 2 minutes onalantecr.azurecr.io/edgehub:latest
onalantecr.azurecr.io/edgehub latest 26a7305d02fa 29 hours ago 168MB Bad case
If I may, perhaps this problem can be tackled in a different way. Let's say that the twin resolution is truly impossible in the current architecture and the default route cannot be established right away. What if this was not considered a fatal problem for telemetry messages? What if, |
When the twin cannot be resolved, (by my understanding) it does proceed as you have described. The problem is that IoT Hub Could you try I am sorry this is taking such a long time. Once again, thank you for your patience. |
@onalante-msft - the latest image results in $ iotedge list | grep edgeHub; docker images --digests | grep edgehub
edgeHub failed Failed (139) 2 minutes ago onalantecr.azurecr.io/edgehub:latest
onalantecr.azurecr.io/edgehub latest sha256:e4fd00a1f1cb690db91db4307139bff94fc6152e135314cbbb02684880f25765 ae694bb7efd8 7 hours ago 168MB
UPDATE - it looks like after failing to starting back to back for quite a few times, eventually |
Yes; unfortunately, the fact is that edgeHub should not start unless it gets a valid configuration from one of its sources. Looking through the logs you have submitted so far, it actually looks like the daemon is not initializing with a cached configuration, and so is completely dependent on receiving a valid configuration from IoT Hub [0]. If it cannot do this, starting up without a valid configuration (and thus activating the protocol heads) would mean that modules could connect, but would not have any message routes configured. As a result, exiting the edgeHub process and making edgeAgent restart it until the identity is provisioned (or the retry limit is reached) seems like the only course of action without also getting involved in IoT Hub's twin synchronization code. In this way, modules should block until edgeHub starts successfully and activates the protocol heads. Also, just to clarify, while those errors you are seeing are from more-or-less the same place in edgeHub that we had discussed earlier, one of the changes in my branch is that the desired properties are not set by upstream unless they parse correctly. So, we do not run the risk of corrupting local state like we would previously. [0] On that note, my apologies about talking about the identity cache for so long in my earlier comments. Clearly, I still have some more work to do on log examination. |
Thank for for continuing to investigate this. I wanted to take this opportunity to summarize the overall expected behavior for the edge device. What happens in |
It looks like in the C SDK, the response typedef void (*IOTHUB_CLIENT_EVENT_CONFIRMATION_CALLBACK)(IOTHUB_CLIENT_CONFIRMATION_RESULT result, void* userContextCallback);
[0] The precise circumstances under which |
@onalante-msft - I actually have a callback registered before sending the telemetry message. The problem is that when |
There is no timeout for messages and events by default, but you can specify one with |
@onalante-msft - thank you for pointing that option out. It seems that |
Oh, my mistake. I missed that deprecation notice. However, I do not understand one thing: it looks like A useful document: https://github.com/Azure/azure-iot-sdk-c/blob/master/doc/Iothub_sdk_options.md#mqtt-amqp-and-http-specific-protocol-options. Is there anything else I can help with? I will work on getting this fix into the repository in the meantime. |
My module is definitely using MQTT. The module code is written using IoT C SDK which has a critical limitation with AMQP (see Azure/azure-iot-sdk-c#1397 (comment)). The logs that I have been posting are for
It appears you are correct. I was misreading the comment. However, this also means that my original comment about my module not having any way to know when
Can you please clarify which fix you are working towards putting into the repo? Which version of the repo (1.1.x or 1.2.x)? |
The fix that would be put into the repository is to prevent edgeHub from starting without a valid configuration. The plan is to include this in 1.1, 1.2, and the main branch. Does e.g. |
I am currently using the C SDK default |
As far as I can tell, the maximum is 0 by default. |
This issue is being marked as stale because it has been open for 30 days with no activity. |
…tion (Azure#5515) Currently, EdgeHub 1.1 can start without a valid configuration. This PR backports logic introduced in later releases to check that a configuration was pulled successfully, and modifies the configuration pulling task to fail when appropriate. Fixes Azure#5398.
Expected Behavior
When an IoT Edge device is deployed, the modules specified in the deployment are started. If module docker images are already downloaded, the starting of those modules is very quick. This means that
$edgeHub
is started almost immediately after the edge device is deployed because"startupOrder" : 0
is specified foredgeHub
in the deployment. Also per deployment specification, my custom module is started after$edgeHub
(no explicitstartupOrder
in deployment). Shortly after my custom module starts, it sends a telemetry channel message using the C IoT SDKIoTHubModuleClient_LL_SendEventAsync()
function. The module checks the return value from this function call to verify that it isIOTHUB_CLIENT_OK
. So, from my custom module perspective, all should be OK and the message should be received a short time later in the cloud IoT Hub.Current Behavior
About half of the time, the cloud IoT Hub never receives that first message. I ensured that all routing from the hub is turned off and no one is consuming the message. I am using Azure IoT Explorer to listen to all telemetry messages on the hub to see if the message is received in the cloud or not. Whether the message is received in the cloud depends on whether
edgeHub
module had a startup error preventing it from setting up a default route immediately on startup.The return value for
IoTHubModuleClient_LL_SendEventAsync()
function is always the same, indicating success. There does not appear to be any other C IoT SDK API to find out if theedgeHub
has created the default route to know if it's OK to send telemetry messages.Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
"startupOrder" : 0
), and a custom module which sends telemetry messages shortly after startup.iotedge
startsiotedge
process and delete this device from IoTHubiotedge
process which will re-create the device in IoT Hub. The deployment will follow soon after. Once the edge device has the deployment, modules will start VERY quickly since docker images are already on the deviceedgeHub
log to know whether default route was created on startup or not. You can see the effect of that in IoT Hub in the cloud which will either receive the telemetry message or not.Context (Environment)
Output of
iotedge check
Device Information
Runtime Versions
Logs
Good case
Bad case
Additional Information
Even in the "bad case" where the default route is not created right away, when the 5 minute delay expires and the default route is eventually created, further module telemetry messages are conveyed to the cloud without problems.
The text was updated successfully, but these errors were encountered: