Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution') #473

Closed
MiguelCHR opened this issue Mar 2, 2020 · 34 comments

Comments

@MiguelCHR
Copy link

MiguelCHR commented Mar 2, 2020

Hi i am having this problem, this is for azure-iot-device 2.1.0

Any ideas? thanks in advance

ERROR:azure.iot.device.common.pipeline.pipeline_stages_mqtt:transport.connect raised error
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.pipeline.pipeline_stages_mqtt:Traceback (most recent call last):
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/mqtt_transport.py", line 367, in connect
02.03.20 14:34:40 (-0700)  core      host=self._hostname, port=8883, keepalive=DEFAULT_KEEPALIVE
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 937, in connect
02.03.20 14:34:40 (-0700)  core      return self.reconnect()
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 1071, in reconnect
02.03.20 14:34:40 (-0700)  core      sock = self._create_socket_connection()
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 3522, in _create_socket_connection
02.03.20 14:34:40 (-0700)  core      return socket.create_connection(addr, source_address=source, timeout=self._keepalive)
02.03.20 14:34:40 (-0700)  core    File "/usr/lib/python3.7/socket.py", line 707, in create_connection
02.03.20 14:34:40 (-0700)  core      for res in getaddrinfo(host, port, 0, SOCK_STREAM):
02.03.20 14:34:40 (-0700)  core    File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
02.03.20 14:34:40 (-0700)  core      for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
02.03.20 14:34:40 (-0700)  core  socket.gaierror: [Errno -3] Temporary failure in name resolution
02.03.20 14:34:40 (-0700)  core  
02.03.20 14:34:40 (-0700)  core  The above exception was the direct cause of the following exception:
02.03.20 14:34:40 (-0700)  core  
02.03.20 14:34:40 (-0700)  core  Traceback (most recent call last):
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 117, in _run_op
02.03.20 14:34:40 (-0700)  core      self.transport.connect(password=self.sas_token)
02.03.20 14:34:40 (-0700)  core    File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/mqtt_transport.py", line 387, in connect
02.03.20 14:34:40 (-0700)  core      raise exceptions.ConnectionFailedError(cause=e)
02.03.20 14:34:40 (-0700)  core  azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
02.03.20 14:34:40 (-0700)  core  
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.pipeline.pipeline_ops_base:ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.pipeline.pipeline_stages_base:ConnectionLockStage(ConnectOperation): op failed.  Unblocking queue with error: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.pipeline.pipeline_ops_base:ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.async_adapter:Callback completed with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
02.03.20 14:34:40 (-0700)  core  ERROR:azure.iot.device.common.async_adapter:["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"]

AB#7366699

@BertKleewein
Copy link
Member

@MiguelCHR - can you tell us a little more about your environment. Are you running inside a docker container? If so, what is the base image? If not, what OS are you using? Are you trying to connect to IotHub or IotEdge? If you have a connection string, what happens when you try to ping or tracert the HostName from your connection string?

@MiguelCHR
Copy link
Author

MiguelCHR commented Mar 4, 2020

Hi:
Are you running inside a docker container?
Yes

If so, what is the base image?
hub.docker.com/r/balenalib/fincm3-python
github.com/balena-io-library/base-images/tree/master/balena-base-images/python/fincm3

Are you trying to connect to IotHub or IotEdge?
IoTHub

If you have a connection string, what happens when you try to ping or tracert the HostName from your connection string?
Yes we have connection string:
Can't do a ping or tracert. due to error happening sporadically in the field (production environment)

Thanks in advance

@BertKleewein
Copy link
Member

Can you be more specific on your base image? I see images in there built on alpine, debian, fedora, and ubuntu. I'm asking because I've seen intermittent failures from the dns resolver that alpine uses and I've never been impressed by it. More specifically, there are issues with the nslookup from busybox 1.28 which is used in Alpine 3.9. moby/libnetwork#2371. I can't say for sure that this is your issue, but it could be.

@MiguelCHR
Copy link
Author

We currently using Debian, thanks

@MiguelCHR
Copy link
Author

Hi any update in this error? we are currently having a lot of devices on the field with this issue, thanks in advance

@BertKleewein
Copy link
Member

Our library should only be considering this a fatal error if this is the first connection to IoTHub for the current run of the executable. it assumes that a failed connection is a configuration issue until the it can connect once. Once the library connects successfully, it assumes that the configuration is valid and that the failure is transient. If the error is fatal (connecting for the first time), it fails immediately. If the error is transient (because it has previously connected), it retries until a connection can be established.

We are working to improve this behavior.

Just out of curiosity, is this causing an API to raise an exception, or are you observing this by some other means (e.g. API timeout of observation of logging output)?

@MiguelCHR
Copy link
Author

We fix this error by restarting the device, this error never happen at the first connection it happens when the device if online and reporting for some time, at least this is for our case.

Yes the exception is raised by an API

Thanks!

@BertKleewein
Copy link
Member

This may be fixed in 2.1.1, but I'm going to review this specific issue a little further before I pronounce it fixed I've made some fixes but there are more extreme fixes I could make. Please let me know if this is resolved.

@MiguelCHR
Copy link
Author

Thanks Bert i will be updating all our devices to 2.1.1, I will let you guys know if anything else comes up

@mikechari
Copy link

Hi, we still getting this error on 2.1.1

13.05.20 08:38:34 (-0600)  core  Traceback (most recent call last):
13.05.20 08:38:34 (-0600)  core    File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 168, in _run_op
13.05.20 08:38:34 (-0600)  core      self.transport.connect(password=self.sas_token)
13.05.20 08:38:34 (-0600)  core    File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/mqtt_transport.py", line 398, in connect
13.05.20 08:38:34 (-0600)  core      raise exceptions.ConnectionFailedError(cause=e)
13.05.20 08:38:34 (-0600)  core  azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
13.05.20 08:38:34 (-0600)  core  
13.05.20 08:38:34 (-0600)  core  ERROR:azure.iot.device.common.pipeline.pipeline_ops_base:ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
13.05.20 08:38:34 (-0600)  core  ERROR:azure.iot.device.common.pipeline.pipeline_stages_base:ConnectionLockStage(ConnectOperation): op failed.  Unblocking queue with error: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
13.05.20 08:38:34 (-0600)  core  ERROR:azure.iot.device.common.pipeline.pipeline_ops_base:ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
13.05.20 08:38:34 (-0600)  core  ERROR:azure.iot.device.common.async_adapter:Callback completed with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
13.05.20 08:38:34 (-0600)  core  ERROR:azure.iot.device.common.async_adapter:["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')\n"]

It seems it is the same that was reported before, the devices are offline in the front end and online in the back end,

Still need to update to 2.1.2, it may fix this?

Thanks

@MiguelCHR
Copy link
Author

Also, not sure if this is the place to ask but we haven't found any documentation regarding the removal of logs, so how can we remove all the logs from the azure.iot.device library?

Thanks in advance

@dschenzer
Copy link
Member

@BertKleewein my customer has the same issue:

Repeating the same test scenario (changing from network O2 to Telekom)

May 27 13:16:13: ERROR:transport.connect raised error
May 27 13:16:13: socket.gaierror: [Errno -3] Temporary failure in name resolution
May 27 13:16:13: azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
May 27 13:16:13: ERROR:ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
May 27 13:16:13: ERROR:ConnectionLockStage(ConnectOperation): op failed. Unblocking queue with error: ConnectionFailedError(None) caused by gaierror(-3, 'Temporary failure in name resolution')
May 27 13:16:44: ERROR:ConnectOperation: completing with error OperationCancelled('Transport timeout on connection operation')
May 27 13:16:44: ERROR:ConnectionLockStage(ConnectOperation): op failed. Unblocking queue with error: ConnectionFailedError(None) caused by gaierror

Do you have a fix for this issue?

@BertKleewein
Copy link
Member

@MiguelCHR - you can remove almost all of the logging by calling logging.getLogger("azure.iot.device").setLevel(level=logging.ERROR).
If that is still too much, you can pass level=logging.CRITICAL.

@BertKleewein
Copy link
Member

@MiguelCHR - azure-iot-device 2.1.3 has been released to pypi. 2.1.1, 2.1.2, and 2.1.3 all contain various stability and reliability fixes. Any of them could fix problems that have this symptom.

Can you please try to reproduce this with the latest version. If you can make it happen, what is the device client API that failed?

@MiguelCHR
Copy link
Author

Thanks for the heads up Bert, I will keep an eye on the devices at 2.1.3 version to see if we can catch any errors, I will let you know ASAP.

@nishad1092
Copy link

Hi @BertKleewein ,

I'm facing these errors now, I made some changes to my python script in the modules, and its giving


socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 168, in _run_op
    self.transport.connect(password=self.sas_token)
  File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/mqtt_transport.py", line 405, in connect
    raise exceptions.ConnectionFailedError(cause=e)
azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')

ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')
ConnectionLockStage(ConnectOperation): op failed.  Unblocking queue with error: ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')
ConnectOperation: completing with error ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')
Callback completed with error ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')
["azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')\n"]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/mqtt_transport.py", line 383, in connect

What is causing these connection error? I have made all the changes to my python script while I was inside the container and I made sure it was working fine before I deployed to my device. I'm usign azure-iot-device=2.1.1 and I also tried 2.1.4 but still eror.

@BertKleewein
Copy link
Member

@nishad1092, can you tell me what client API is failing please?

@nishad1092
Copy link

Hi @BertKleewein ,
Im using IoTHubModuleClient, and azure-iot-device = 2.1.1

@nishad1092
Copy link

Everytime i deploy it through VScode to my edge device, it fails and gives out this issue, This never happened before actually, its happening after I updated my python script, But i havent changed or added any new libraries

@BertKleewein
Copy link
Member

But what is "it"? What is failing? I'm asking because having errors show up via logger.error() is not the same as APIs failing.

Our code currently logs error messages even when we handle the error and I'm trying to discriminate between "an error was logged but was successfully handled" and "an error caused a client API to fail".

@nishad1092
Copy link

Sure Sir, I really couldnt figure out what went wrong, Please tell me what you want to know?

I have two modules, One which gets message from Cloud invoke, and another one gets the message as input from first module, Basically a intermodule commuinication. So far It has worked fine, And whenever I want to make changes, I go into the container make changes and then I commit it, Recently I have made changes but I have not made any huge changes to anyh library or anythign, very small changes to my python script.

My python script has a MQTT and a topic too. Just now I ran the working version of my module and went inside the container and made all changes I need and it is working fine inside container as a edge module, but when I pushed those changes through my deployment manifest file, it is failing continuous.

@nishad1092
Copy link

Both modules gives

Connected with result code 0
transport.connect raised error
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/mqtt_transport.py", line 383, in connect
host=self._hostname, port=8883, keepalive=DEFAULT_KEEPALIVE
File "/usr/local/lib/python3.7/site-packages/paho/mqtt/client.py", line 937, in connect
return self.reconnect()
File "/usr/local/lib/python3.7/site-packages/paho/mqtt/client.py", line 1071, in reconnect
sock = self._create_socket_connection()
File "/usr/local/lib/python3.7/site-packages/paho/mqtt/client.py", line 3522, in _create_socket_connection
return socket.create_connection(addr, source_address=source, timeout=self._keepalive)
File "/usr/local/lib/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/local/lib/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 168, in _run_op
self.transport.connect(password=self.sas_token)
File "/usr/local/lib/python3.7/site-packages/azure/iot/device/common/mqtt_transport.py", line 405, in connect
raise exceptions.ConnectionFailedError(cause=e)
azure.iot.device.common.transport_exceptions.ConnectionFailedError: ConnectionFailedError(None) caused by gaierror(-2, 'Name or service not known')

@nishad1092
Copy link

@BertKleewein basically When I see edgeHub, I have three modules, each module being a client, these two clients are not gettign connected.

@BertKleewein
Copy link
Member

Please tell me if I understand this correctly:

You're calling IoTHubModuleClient.connect() and IoTHubModuleClient.connect() is raising an exception that you are able to catch from your client app.

But, your app only catches this exception if you use a deployment manifest to deploy. If you change your code by editing a live container based on an older build, but still on the same machine with the same install of IoTEdge, then IoTHubModuleClient.connect() does not fail.

Is this correct?

@nishad1092
Copy link

@BertKleewein you are correct, This is exactly what I have been facing for two days.

@nishad1092
Copy link

@BertKleewein , Any idea on what might be causing this issues?

@BertKleewein
Copy link
Member

I'm at a loss. The error you're reporting (gaierror(-2, 'Name or service not known')) is an error from the underlying network stack saying that it can't get the address of the machine it's trying to connect to. Nothing that we changed should affect this.

Since you're connecting to edgeHub, it means it can't find address of the edgeHub machine. It might be worth trying to ping your edgeHub machine from inside the container to see if it can resolve.

(horton) bertk@homework:~$ hostname
homework
(horton) bertk@homework:~$ docker exec -it testMod /bin/sh
/wrapper # ping homework
PING homework (172.18.0.3): 56 data bytes
64 bytes from 172.18.0.3: seq=0 ttl=64 time=0.093 ms
64 bytes from 172.18.0.3: seq=1 ttl=64 time=0.060 ms

Since you're manually calling IoTHubModuleClient.connect and it' failing, another option is to sleep for a few seconds and try calling connect again. I don't necessarily like this option, but this might be easier than fixing your network configuration.

@nishad1092
Copy link

nishad1092 commented Jul 7, 2020

@BertKleewein Sure, let me check once and Ill get back to you. But also, Since my module is failed, i cant get inside the container. Ill try to get insside the cointainer with my previous successful build

@nishad1092
Copy link

nishad1092 commented Jul 7, 2020

Hi @BertKleewein , im not able to ping the hostname of docker inside container and Also I get same error when I ping hostname outside the container, it gives Name or service not known,
These are the outcomes:
1.But I can ping my device IP from inside the container..
2. I can also ping my deivce Ip from outside container.
3. But I cant ping hostname from outside container.
4. I cant ping hostname from inside container.
image

@nishad1092
Copy link

I gave a 30 seconds sleep before await module_client.connect(), but after that same error shown up

@nishad1092
Copy link

nishad1092 commented Jul 7, 2020

Hi @BertKleewein ,

I had this particular parameters in my manifest file which was conflicting these err,:
createOptions": {
"NetworkingConfig": {
"EndpointsConfig": {
"host": {}
}
},

            "HostConfig": {
              "NetworkMode": "host",

I saw one of this config, from another Azure post which is why I had it in the first place.\

Now all modules are running fine. Thank you so much for your support

@BertKleewein
Copy link
Member

@nishad1092 - that explains it. If the OS can't resolve the hostname to an IP address, then we won't be able to connect. It looks like you broke your network configuration, maybe with that change to your manifest, and it looks like this problem isn't related to the azure-iot-device python library at all.

@nishad1092
Copy link

Hi @BertKleewein , Sorry for the late reply. Yes, I was thinking it is related to azure-iot-device python library, hence posted here.

But with this manifest config it was working before, I dont know what suddenly went through, Anyways all good now. Thank you Bert

@az-iot-builder-01
Copy link
Collaborator

@BertKleewein, @MiguelCHR, @mikechari, @dschenzer, @nishad1092, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants