One of containers intermittently can NOT access another container in same overlay network #1946

zxkane · 2017-09-20T08:52:20Z

I'm using docker swarm(17.06 CE) to orchestrate my micro services. The swarm cluster has 3 managers and 1 worker.

I have a Nginx service running in swarm managers globally(3 container instances). I also have a Java based micro services having 2 replicas in the same overlay network.

Now I found that one of Nginx containers can NOT access the micro service. The other two Nginx containers can access the service without problem.

### there are three nginx containers in swarm  
➜  ~ docker service ps pilipa-prod-nginx 
ID                  NAME                             IMAGE                                             NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS 
qufld0uu8tk9        pilipa-prod-nginx.4r2p0t892qn55n4uewoymxbp0   registry.i-counting.cn/pilipa/prod/nginx:latest   node02              Running             Running 21 hours ago 
bwjw9c9dm8e1        pilipa-prod-nginx.ixw4urfkdcnkm326vgkw92x8n   registry.i-counting.cn/pilipa/prod/nginx:latest   node01              Running             Running 21 hours ago 
2w2gg83xt6g4        pilipa-prod-nginx.5t63dl8dcj603iyw5l5vv0xvx   registry.i-counting.cn/pilipa/prod/nginx:latest   node03              Running             Running 21 hours ago

### log in the working Nginx, it can access the micro service without problem  
➜  ~ docker exec --interactive --tty pilipa-prod-nginx.4r2p0t892qn55n4uewoymxbp0.qufld0uu8tk9ieubcimed8fgw 
sh / # ip addr show 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever 10901: eth0@if10902: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:00:00:2c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.44/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.11/32 scope global eth0
       valid_lft forever preferred_lft forever 10903: eth1@if10904: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:13:00:09 brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.9/16 scope global eth1
       valid_lft forever preferred_lft forever 
 / # wget 10.0.0.71:8080 Connecting to 10.0.0.71:8080 (10.0.0.71:8080) wget: server returned error: HTTP/1.1 401 Unauthorized

### log in the problematic Nginx container which can ping the host of micro service, but can NOT access the service
➜  ~ docker exec --interactive --tty pilipa-prod-nginx.ixw4urfkdcnkm326vgkw92x8n.bwjw9c9dm8e1qlx64z5sniw7h sh
/ #
/ #
/ # wget 10.0.0.71:8080
Connecting to 10.0.0.71:8080 (10.0.0.71:8080)
wget: can't connect to remote host (10.0.0.71): Connection refused
/ # ping 10.0.0.71
PING 10.0.0.71 (10.0.0.71): 56 data bytes
64 bytes from 10.0.0.71: seq=0 ttl=64 time=0.066 ms
64 bytes from 10.0.0.71: seq=1 ttl=64 time=0.076 ms
64 bytes from 10.0.0.71: seq=2 ttl=64 time=0.073 ms
^C
--- 10.0.0.71 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.066/0.071/0.076 ms

I also tried to use tcpdump to capture the traffic in the micro service container. I could capture the traffics from the working Nginx container when using both ping 10.0.0.71 and wget 10.0.0.71:8080 to access the service. However there was no traffic captured either ping or wget from the problematic Nginx container!

And the issue is not always happening. The problem might go away after running the few hours. And it happens again after running another few hours!

Below is the gist including output of running support.sh,
support-output.txt

Let me know if you need more information for analyzing it.

The text was updated successfully, but these errors were encountered:

Nossnevs · 2017-09-29T10:06:04Z

Might be the same issue as the
moby/moby#32195 or #1934
Hopefully those will be fixed with #1935

fcrisciani · 2017-10-21T16:36:43Z

@zxkane please try to upgrade to 17.10 there were several fixes in the overlay area.

fcrisciani · 2017-10-21T16:44:41Z

closing this one at the moment, sounds like #1934, feel free to reopen it if you see the same behavior on 17.10

fcrisciani closed this as completed Oct 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One of containers intermittently can NOT access another container in same overlay network #1946

One of containers intermittently can NOT access another container in same overlay network #1946

zxkane commented Sep 20, 2017 •

edited

Loading

Nossnevs commented Sep 29, 2017

fcrisciani commented Oct 21, 2017

fcrisciani commented Oct 21, 2017

One of containers intermittently can NOT access another container in same overlay network #1946

One of containers intermittently can NOT access another container in same overlay network #1946

Comments

zxkane commented Sep 20, 2017 • edited Loading

Nossnevs commented Sep 29, 2017

fcrisciani commented Oct 21, 2017

fcrisciani commented Oct 21, 2017

zxkane commented Sep 20, 2017 •

edited

Loading