Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Hub removes module connection #4678

Open
fbarresi opened this issue Mar 24, 2021 · 13 comments
Open

Edge Hub removes module connection #4678

fbarresi opened this issue Mar 24, 2021 · 13 comments
Assignees
Labels
1.0.10 Targeted for 1.0.10 area:connectivity Anything related to connectivity, networking, and related. customer-reported iotedge no-issue-activity

Comments

@fbarresi
Copy link

Preamble

I have many Edge devices working on production environment.
Every device has many modules that send data to the upstream and one module (named apiconnector) that sends the upstream to another API.

The Problem

Since an upgrade to the version 1.0.10, I get sporadically errors regarding the apiconnector module.
The module is running but the input stream get cut from the edgeHub.
I saw in the logfile from edgeHub that the problem may be related to an error while updating the identity token.
Otherwise there is no retry procedure for such an error even if the edgeHub writes a warning every minutes that a module is disconnected.

Expected Behavior

Running Edge modules receive the input stream in a reliable way. A transient update error for identity tokens doesn't effect the behavior of the modules.
After an error occurs the edgeHub should try to restore the an healthy state or report the error (for example to the edgeAgent)

Current Behavior

An Edge module get disconnected from its input stream, but keep be running.

Steps to Reproduce

Unable to reproduce... 😢

Context (Environment)

(sorry, I cannot fill in all these Information, but I will give an update)

Device Information

  • Host OS [e.g. Ubuntu 18.04, Windows Server IoT 2019]: Windows 10 1803 Build 17763
  • Architecture [e.g. amd64, arm32, arm64]: amd64
  • Container OS [e.g. Linux containers, Windows containers]: Windows containers

Runtime Versions

  • aziot-edged [run iotedge version]:
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.0.10.4
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.0.10.4
  • Docker/Moby [run docker version]: ?

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

edge-hub logs

<6> 2021-03-22 13:23:01.039 +01:00 [INF] - Request starting HTTP/1.1 GET http://172.22.185.99:9600/metrics  
<6> 2021-03-22 13:23:01.048 +01:00 [INF] - Request finished in 8.5752ms 200 text/plain; version=0.0.4; charset=utf-8
<6> 2021-03-22 13:26:07.889 +01:00 [INF] - Starting compaction of stores
<6> 2021-03-22 13:26:07.889 +01:00 [INF] - Starting compaction of store MetadataStore
<6> 2021-03-22 13:26:07.889 +01:00 [INF] - Starting compaction of store twins
<6> 2021-03-22 13:26:07.889 +01:00 [INF] - Starting compaction of store checkpoints
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store GP731/apiconnector/input
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store default
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store iothub
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store messages
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store EdgeTwin
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store sessions
<6> 2021-03-22 13:26:07.890 +01:00 [INF] - Starting compaction of store DeviceScopeCache
<6> 2021-03-22 13:26:09.150 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:27:27.981 +01:00 [INF] - Starting periodic operation Get EdgeHub config...
<6> 2021-03-22 13:27:28.107 +01:00 [INF] - Obtained edge hub config from module twin
<6> 2021-03-22 13:27:28.124 +01:00 [INF] - Set the following 2 route(s) in edge hub
<6> 2021-03-22 13:27:28.125 +01:00 [INF] - upstream: FROM /messages/modules/* INTO $upstream
<6> 2021-03-22 13:27:28.125 +01:00 [INF] - apiconnector: FROM /messages/modules/* INTO BrokeredEndpoint(\"/modules/apiconnector/inputs/input\")
<6> 2021-03-22 13:27:28.125 +01:00 [INF] - Updated message store TTL to 86400 seconds
<6> 2021-03-22 13:27:28.125 +01:00 [INF] - Updated the edge hub store and forward configuration
<6> 2021-03-22 13:27:28.125 +01:00 [INF] - Successfully completed periodic operation Get EdgeHub config
<6> 2021-03-22 13:27:28.282 +01:00 [INF] - Updated reported properties for GP731/$edgeHub
<6> 2021-03-22 13:31:09.142 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:36:09.158 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:38:57.088 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:38:57.088 +01:00 [INF] - Token updated for GP731/strippingribbedhose
<6> 2021-03-22 13:39:10.071 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:39:10.072 +01:00 [INF] - Token updated for GP731/localapi
<6> 2021-03-22 13:39:25.100 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:39:25.100 +01:00 [INF] - Token updated for GP731/coilmachine
<6> 2021-03-22 13:39:45.066 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:39:45.067 +01:00 [INF] - Token updated for GP731/weld
<6> 2021-03-22 13:40:07.042 +01:00 [INF] - New token requested by client GP731/strippingribbedhose, but using existing token as it is usable.
<6> 2021-03-22 13:40:20.054 +01:00 [INF] - New token requested by client GP731/localapi, but using existing token as it is usable.
<6> 2021-03-22 13:40:55.039 +01:00 [INF] - New token requested by client GP731/weld, but using existing token as it is usable.
<6> 2021-03-22 13:41:09.151 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:41:21.089 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:41:21.089 +01:00 [INF] - Token updated for GP731/newprogram
<6> 2021-03-22 13:41:33.072 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:41:33.073 +01:00 [INF] - Token updated for GP731/systemstate
<6> 2021-03-22 13:42:51.370 +01:00 [INF] - Started task to cleanup processed and stale messages for endpoint GP731/apiconnector/input
<6> 2021-03-22 13:42:53.090 +01:00 [INF] - Cleaned up 5472 messages from queue for endpoint GP731/apiconnector/input and 104 messages from message store.
<6> 2021-03-22 13:43:23.103 +01:00 [INF] - Started task to cleanup processed and stale messages for endpoint iothub
<6> 2021-03-22 13:43:25.116 +01:00 [INF] - Cleaned up 5478 messages from queue for endpoint iothub and 5368 messages from message store.
<6> 2021-03-22 13:46:09.153 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:46:46.013 +01:00 [INF] - Existing token not found for GP731/strippingribbedhose. Getting new token from the client...
<6> 2021-03-22 13:46:46.013 +01:00 [INF] - Connection status for GP731/strippingribbedhose changed to TokenNearExpiry
<6> 2021-03-22 13:46:46.014 +01:00 [INF] - Obtained new token for client GP731/strippingribbedhose that expires in 00:52:10.9859210
<6> 2021-03-22 13:46:59.013 +01:00 [INF] - Existing token not found for GP731/localapi. Getting new token from the client...
<6> 2021-03-22 13:46:59.013 +01:00 [INF] - Connection status for GP731/localapi changed to TokenNearExpiry
<6> 2021-03-22 13:46:59.016 +01:00 [INF] - Obtained new token for client GP731/localapi that expires in 00:52:10.9838802
<6> 2021-03-22 13:47:34.012 +01:00 [INF] - Existing token not found for GP731/weld. Getting new token from the client...
<6> 2021-03-22 13:47:34.013 +01:00 [INF] - Connection status for GP731/weld changed to TokenNearExpiry
<6> 2021-03-22 13:47:34.013 +01:00 [INF] - Obtained new token for client GP731/weld that expires in 00:52:10.9868077
<6> 2021-03-22 13:51:09.172 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-03-22 13:54:11.263 +01:00 [INF] - Starting refresh of device scope identities cache
<6> 2021-03-22 13:55:24.091 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:55:24.091 +01:00 [INF] - Token updated for GP731/apiconnector
<6> 2021-03-22 13:56:09.177 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<4> 2021-03-22 13:59:12.436 +01:00 [WRN] - Error while processing the service identity for GP731/apiconnector
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\TaskEx.cs:line 128
   at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\edged\\WorkloadClientVersioned.cs:line 79
   at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.HandleException(Exception ex, String operation) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\edged\\version_2019_01_30\\WorkloadClient.cs:line 122
   at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\edged\\WorkloadClientVersioned.cs:line 79
   at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.EncryptAsync(String initializationVector, String plainText) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\edged\\version_2019_01_30\\WorkloadClient.cs:line 67
   at Microsoft.Azure.Devices.Edge.Storage.EncryptedStore`2.Put(TK key, TV value, CancellationToken cancellationToken) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Storage\\EncryptedStore.cs:line 42
   at Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeIdentitiesCache.SaveServiceIdentityToStore(String id, StoredServiceIdentity storedServiceIdentity) in C:\\agent\\_work\\9\\s\\edge-hub\\src\\Microsoft.Azure.Devices.Edge.Hub.Core\\DeviceScopeIdentitiesCache.cs:line 262
   at Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeIdentitiesCache.HandleNewServiceIdentity(ServiceIdentity serviceIdentity) in C:\\agent\\_work\\9\\s\\edge-hub\\src\\Microsoft.Azure.Devices.Edge.Hub.Core\\DeviceScopeIdentitiesCache.cs:line 256
   at Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeIdentitiesCache.RefreshCache() in C:\\agent\\_work\\9\\s\\edge-hub\\src\\Microsoft.Azure.Devices.Edge.Hub.Core\\DeviceScopeIdentitiesCache.cs:line 167
   <6> 2021-03-22 13:59:12.560 +01:00 [INF] - Service identity for GP731/apiconnector was removed from device scope, dropping client connection.
<6> 2021-03-22 13:59:12.560 +01:00 [INF] - Removing device connection for device GP731/apiconnector with removeCloudConnection flag 'True'.
<6> 2021-03-22 13:59:12.560 +01:00 [INF] - Closing AMQP device proxy for GP731/apiconnector because no handler was registered.
Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device GP731/apiconnector.
<6> 2021-03-22 13:59:12.566 +01:00 [INF] - Closing link Events for GP731/apiconnector
<6> 2021-03-22 13:59:12.566 +01:00 [INF] - Closing link Events for GP731/apiconnector
<6> 2021-03-22 13:59:12.574 +01:00 [INF] - Closing link TwinSending for GP731/apiconnector
<6> 2021-03-22 13:59:12.575 +01:00 [INF] - Closing link TwinSending for GP731/apiconnector
<6> 2021-03-22 13:59:12.575 +01:00 [INF] - Closing link TwinReceiving for GP731/apiconnector
<6> 2021-03-22 13:59:12.575 +01:00 [INF] - Closing link TwinReceiving for GP731/apiconnector
<6> 2021-03-22 13:59:12.575 +01:00 [INF] - Closing link ModuleMessages for GP731/apiconnector
<6> 2021-03-22 13:59:12.575 +01:00 [INF] - Closing link ModuleMessages for GP731/apiconnector
<6> 2021-03-22 13:59:12.576 +01:00 [INF] - Closing receiver in cloud proxy fa703a1a-3f8f-47b3-af7e-8e196f01f6eb for GP731/apiconnector
<6> 2021-03-22 13:59:12.576 +01:00 [INF] - Closed cloud proxy fa703a1a-3f8f-47b3-af7e-8e196f01f6eb for GP731/apiconnector
<6> 2021-03-22 13:59:12.576 +01:00 [INF] - Device connection removed for device GP731/apiconnector
<6> 2021-03-22 13:59:12.693 +01:00 [INF] - New token received on the Cbs link
<6> 2021-03-22 13:59:12.693 +01:00 [INF] - Token updated for GP731/apiconnector
<6> 2021-03-22 13:59:12.695 +01:00 [INF] - Opened link Events for GP731/apiconnector
<6> 2021-03-22 13:59:12.699 +01:00 [INF] - Processing pending subscriptions for GP731/apiconnector
<6> 2021-03-22 13:59:12.699 +01:00 [INF] - Opened link TwinSending for GP731/apiconnector
<6> 2021-03-22 13:59:12.699 +01:00 [INF] - Opened link TwinReceiving for GP731/apiconnector
<6> 2021-03-22 13:59:12.700 +01:00 [INF] - Opened link ModuleMessages for GP731/apiconnector
<6> 2021-03-22 13:59:12.700 +01:00 [INF] - Attempting to connect to IoT Hub for client GP731/apiconnector via AMQP...
<6> 2021-03-22 13:59:12.731 +01:00 [INF] - New token requested by client GP731/apiconnector, but using existing token as it is usable.
<6> 2021-03-22 13:59:12.803 +01:00 [INF] - Cloud connection for GP731/apiconnector is True
<6> 2021-03-22 13:59:12.804 +01:00 [INF] - Connection status for GP731/apiconnector changed to ConnectionEstablished
<6> 2021-03-22 13:59:12.804 +01:00 [INF] - Client GP731/apiconnector connected to cloud, processing existing subscriptions.
<6> 2021-03-22 13:59:12.804 +01:00 [INF] - Created cloud proxy for client GP731/apiconnector via AMQP, with client operation timeout 20 seconds.
<6> 2021-03-22 13:59:12.805 +01:00 [INF] - Initialized cloud proxy daefa46d-2ffc-47f0-b229-8c20f4e816a4 for GP731/apiconnector
<6> 2021-03-22 13:59:12.805 +01:00 [INF] - Created cloud connection for client GP731/apiconnector
<6> 2021-03-22 13:59:12.826 +01:00 [INF] - Updated reported properties for GP731/$edgeHub
<6> 2021-03-22 13:59:12.846 +01:00 [INF] - Processing pending subscriptions for GP731/apiconnector
<6> 2021-03-22 13:59:12.846 +01:00 [INF] - Processing pending subscriptions for GP731/apiconnector
<4> 2021-03-22 13:59:42.507 +01:00 [WRN] - Did not receive ack for message d71be943-9bd7-4670-984c-14e4e4f67870 from device/module GP731/apiconnector
<4> 2021-03-22 13:59:42.513 +01:00 [WRN] - Error sending messages to module GP731/apiconnector
System.TimeoutException: Message completion response not received
   at Microsoft.Azure.Devices.Edge.Hub.Core.Device.DeviceMessageHandler.SendMessageAsync(IMessage message, String input) in C:\\agent\\_work\\9\\s\\edge-hub\\src\\Microsoft.Azure.Devices.Edge.Hub.Core\\device\\DeviceMessageHandler.cs:line 450
   at Microsoft.Azure.Devices.Edge.Hub.Core.Routing.ModuleEndpoint.ModuleMessageProcessor.ProcessAsync(ICollection`1 routingMessages, IDeviceProxy dp, CancellationToken token) in C:\\agent\\_work\\9\\s\\edge-hub\\src\\Microsoft.Azure.Devices.Edge.Hub.Core\\routing\\ModuleEndpoint.cs:line 167
   <4> 2021-03-22 13:59:43.527 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 13:59:45.721 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 13:59:50.028 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 13:59:59.394 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:00:17.781 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:00:51.961 +01:00 [WRN] - Module GP731/apiconnector is not connected
<6> 2021-03-22 14:01:09.179 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<4> 2021-03-22 14:01:51.964 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:02:51.978 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:03:51.981 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:04:51.981 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:05:51.995 +01:00 [WRN] - Module GP731/apiconnector is not connected
<6> 2021-03-22 14:06:09.170 +01:00 [INF] - Entering periodic task to reauthenticate connected clients
<4> 2021-03-22 14:06:52.012 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:07:52.013 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:08:52.028 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:09:52.029 +01:00 [WRN] - Module GP731/apiconnector is not connected
<4> 2021-03-22 14:10:52.032 +01:00 [WRN] - Module GP731/apiconnector is not connected


Additional Information

I never had such a situation with version 1.0.8.4 .

@vipeller
Copy link
Contributor

Hi @fbarresi,

Can you send us the logs of edge daemon from the same time period? edgeHub made a call there which was not answered, liked to see the daemon side. you can get the daemon logs as described at

https://docs.microsoft.com/en-us/azure/iot-edge/troubleshoot?view=iotedge-2018-06

if you use support bundle, I will need the content from iotedged.txt

@fbarresi
Copy link
Author

Hi @vipeller , Thank you for your reply.

Hier the Logfile. There is a previous exception, but at the time the problem starts (about 14:00) there is nothing.

edge-Agent logs
<6> 2021-03-22 12:22:38.591 +01:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-03-22 12:22:38.634 +01:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 6 and reported properties version 482.
<6> 2021-03-22 12:22:38.637 +01:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-03-22 12:23:00.797 +01:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-03-22 12:23:00.797 +01:00 [INF] - Scraping Metrics
<6> 2021-03-22 12:23:00.797 +01:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-03-22 12:23:00.845 +01:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-03-22 12:23:00.900 +01:00 [INF] - Storing Metrics
<6> 2021-03-22 12:23:00.914 +01:00 [INF] - Scraped and Stored Metrics
<6> 2021-03-22 12:23:00.914 +01:00 [INF] - Successfully completed periodic operation Metrics Scrape
<4> 2021-03-22 12:55:51.172 +01:00 [WRN] - Reconcile failed because of the an exception
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in C:\\agent\\_work\\9\\s\\edge-util\\src\\Microsoft.Azure.Devices.Edge.Util\\TaskEx.cs:line 128
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Edgelet\\versioning\\ModuleManagementHttpClientVersioned.cs:line 146
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Edgelet\\versioning\\ModuleManagementHttpClientVersioned.cs:line 146
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2020_07_07.ModuleManagementHttpClient.GetModules[T](CancellationToken cancellationToken) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Edgelet\\version_2020_07_07\\ModuleManagementHttpClient.cs:line 171
   at Microsoft.Azure.Devices.Edge.Agent.Docker.DockerEnvironment.GetModulesAsync(CancellationToken token) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Docker\\DockerEnvironment.cs:line 51
   at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.GetCurrentModuleSetAsync(CancellationToken token) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Core\\Agent.cs:line 251
   at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in C:\\agent\\_work\\9\\s\\edge-agent\\src\\Microsoft.Azure.Devices.Edge.Agent.Core\\Agent.cs:line 203
   <6> 2021-03-22 12:55:51.479 +01:00 [INF] - Updated reported properties
<6> 2021-03-22 12:55:56.688 +01:00 [INF] - Updated reported properties
<6> 2021-03-22 13:22:21.444 +01:00 [INF] - Starting compaction of stores
<6> 2021-03-22 13:22:21.444 +01:00 [INF] - Starting compaction of store Metrics
<6> 2021-03-22 13:22:21.444 +01:00 [INF] - Starting compaction of store moduleState
<6> 2021-03-22 13:22:21.444 +01:00 [INF] - Starting compaction of store deploymentConfig
<6> 2021-03-22 13:22:21.444 +01:00 [INF] - Starting compaction of store default
<6> 2021-03-22 13:22:38.710 +01:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-03-22 13:22:38.752 +01:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 6 and reported properties version 484.
<6> 2021-03-22 13:22:38.753 +01:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-03-22 13:23:00.980 +01:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-03-22 13:23:00.981 +01:00 [INF] - Scraping Metrics
<6> 2021-03-22 13:23:00.981 +01:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-03-22 13:23:01.035 +01:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-03-22 13:23:01.090 +01:00 [INF] - Storing Metrics
<6> 2021-03-22 13:23:01.110 +01:00 [INF] - Scraped and Stored Metrics
<6> 2021-03-22 13:23:01.110 +01:00 [INF] - Successfully completed periodic operation Metrics Scrape
<6> 2021-03-22 14:22:38.807 +01:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-03-22 14:22:38.857 +01:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 6 and reported properties version 484.
<6> 2021-03-22 14:22:38.861 +01:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-03-22 14:23:01.164 +01:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-03-22 14:23:01.164 +01:00 [INF] - Scraping Metrics
<6> 2021-03-22 14:23:01.164 +01:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-03-22 14:23:01.233 +01:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-03-22 14:23:01.306 +01:00 [INF] - Storing Metrics
<6> 2021-03-22 14:23:01.320 +01:00 [INF] - Scraped and Stored Metrics
<6> 2021-03-22 14:23:01.320 +01:00 [INF] - Successfully completed periodic operation Metrics Scrape
<6> 2021-03-22 15:22:21.518 +01:00 [INF] - Starting compaction of stores
<6> 2021-03-22 15:22:21.518 +01:00 [INF] - Starting compaction of store Metrics
<6> 2021-03-22 15:22:21.518 +01:00 [INF] - Starting compaction of store moduleState
<6> 2021-03-22 15:22:21.518 +01:00 [INF] - Starting compaction of store deploymentConfig
<6> 2021-03-22 15:22:21.518 +01:00 [INF] - Starting compaction of store default
<6> 2021-03-22 15:22:38.903 +01:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-03-22 15:22:38.949 +01:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 6 and reported properties version 484.
<6> 2021-03-22 15:22:38.953 +01:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-03-22 15:23:01.374 +01:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-03-22 15:23:01.375 +01:00 [INF] - Scraping Metrics
<6> 2021-03-22 15:23:01.375 +01:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-03-22 15:23:01.419 +01:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-03-22 15:23:01.503 +01:00 [INF] - Storing Metrics
<6> 2021-03-22 15:23:01.516 +01:00 [INF] - Scraped and Stored Metrics
<6> 2021-03-22 15:23:01.516 +01:00 [INF] - Successfully completed periodic operation Metrics Scrape

@vipeller
Copy link
Contributor

@fbarresi sorry, but those are the logs of edgeAgent. What I need is edge daemon. I know that this is confusing with those many components.

executing the command:

sudo iotedge support-bundle

it will create a compressed file. opening that file, there will be a file with name iotedged.txt. I will need the content of that file in the time period when the problem happened

@vipeller
Copy link
Contributor

Hi @fbarresi,

I am reading the latest update from Soumitra, and just to be sure, I repeat what the problem is:

Some context: the modules uses sas authentication, which for during connection they generate a token which works for an hour. Before the hour expires, clients need to provide a new token. When it happens, edgeHub stores the new token. For storing it calls edge daemon to encrypt the data.

From the logs it seems that time to time edgeHub cannot do the encryption using edger daemon, because that calls fail with timeout. When this happens, edgeHub disconnects that client (as it does not have a valid token anymore)

The expected behavior in this case that clients detect that they got disconnected, they connect back and with the new connection they provide a new authentication token. Now edgeHub either can or cannot encrypt it (considering the timeout-error mentioned above). If it fails again, the just described disconnection/connection loop start again, otherwise everything is supposed to start working.

So far I focused on why the timeout exception happened (that is why I wanted the daemon logs here and from Soumitra), but according to the latest feedback, you are less worried about the timeout error (because other modules seems to recover after that), and the question is that why the apiconnector module cannot do the same.

From the log piece you provided above, there is a repeating "Module GP731/apiconnector is not connected" which happens when edgeHub wants to route a message to a module but that is not connected.

If you are using an SDK like the C# sdk or C sdk in your modules, that is supposed to automatically connect back when they get disconnected. My first question would be that what do you use in apiconnector? (I could not see it from its log)

@fbarresi
Copy link
Author

fbarresi commented Mar 30, 2021

Dear @vipeller,

Thank you for your reply. Thank you also for have information exchange with Soumitra.

As I told Soumitra the edge deamon logs were empty for this time span. So we cannot get more help there.
Yes, the main problem is: the module apiconnector cannot recover the connection, but the output (upstream) for the module (and generally for every module) don't get affected from the identity validation problem. That makes impossible to realize the problem, because the module will keep send messages to the upstream even in case of problems, while the whole input stream get dropped.

Of course I'm interested in a solution for the timeout errors, but these errors are a transient state where the module communication doesn't stop to work. And it is right: at the moment I'm more worried about keeping the system resilient and stable with an auto-reconnect or a module restart because the data-recovery procedure in such a case is very expensive have to be done as quick as possible before other systems in the production will miss those data.

We use the C# SDK and it is the same code basis for all our modules, also for the apiconnector.

@jsucco-growlink
Copy link

also having the same issue in iotedge 1.0.10.2. Is affecting production deployment. Is there a solution coming for this?

@jsucco-growlink
Copy link

Here are the edge daemon logs for my case... The edgehub logs file is called "EdgeHub-logs-ModuleConnectionLoss.txt" and you can reference/coordinate when exception occurs with the daemon logs..
iotedged.txt

@jsucco-growlink
Copy link

forgot to attach the edgehub log file sorry..
EdgeHub-logs-ModuleConnectionLoss.txt

@fbarresi
Copy link
Author

Hi @jsucco-growlink , thank you for join to this issue.

We had a meeting with @vipeller regarding this issue. He wanted to setup a simulation in his own in order to reproduce and better understand the error.

@vipeller
Copy link
Contributor

@jsucco-growlink @fbarresi we were able to repro the issue at our environment and we have a better understanding where the problem can be - we could not find the exact problem yet, though. We are working on it.

@fbarresi
Copy link
Author

FYI: for a better analysis you can activate debug messages for edgeAgent and edgeHub

@github-actions
Copy link

This issue is being marked as stale because it has been open for 30 days with no activity.

@vipeller
Copy link
Contributor

A fix is going to be released with 1.1.3

@pmzara pmzara added 1.0.10 Targeted for 1.0.10 area:connectivity Anything related to connectivity, networking, and related. labels Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.10 Targeted for 1.0.10 area:connectivity Anything related to connectivity, networking, and related. customer-reported iotedge no-issue-activity
Projects
None yet
Development

No branches or pull requests

4 participants