Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS issues #34

Closed
Dbzman opened this issue Oct 17, 2017 · 7 comments
Closed

DNS issues #34

Dbzman opened this issue Oct 17, 2017 · 7 comments

Comments

@Dbzman
Copy link

Dbzman commented Oct 17, 2017

Hey there,

I've been inspired by your work of using your ansible playbooks to provision a K8S cluster with 4 RPis. I tried to get a cluster up and running as well using your scripts. (with the example config and without wifi)

The problem is that I cannot reach other pods or external servers from within a pod. (wanted to put the gitlab runner on there)
Using nslookup kubernetes.default on hypriot/rpi-alpine:3.6 gives the following:

nslookup: can't resolve '(null)': Name does not resolve

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

/etc/resolv.conf looks like this:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local routerbfb8b0.com
options ndots:5

I found out that there's a known issue with alpine up to version 3.3 but I don't use any of these old versions. I tried it with hypriot/rpi-alpine:3.6 and resin/rpi-raspbian:jessie and busybox.
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#known-issues

I also used an upgrade weave(2.0.5) but that did not help as well. I couldn't try with flannel since your scripts are not 100% finished there. kube-dns logs does not show any errors.
Do you have any suggestions? I don't know where else to look.

Thank you very much!

EDIT:
I found out that internal names can be resolved. So I assume kube-dns is basically working but I cannot get external names to be resolved.

EDIT 2:
Seems like I cannot access the internet at all with the following images:

  • hypriot/rpi-alpine:3.6
  • resin/rpi-raspbian:jessie
    busybox seems to only image which works.
    I can work around this "limitation" by specifying hostNetwork: true but this is not something I want to prefer as a solution. I see that the pod is then getting the node ip and is able to go through my router. :/ Also by using that I cannot resolve K8S related services anymore.
    Any ideas how to get around this setting?
@Dbzman
Copy link
Author

Dbzman commented Oct 20, 2017

Hey again,
I got it to work by downgrading Docker to 1.12.
Seems like I ran into the issue which is also described in the K8S 1.8 release notes:

Docker 1.13.1 and 17.03.2

Shared PID namespace, live-restore, and overlay2 were validated.

Known issues

The default iptables FORWARD policy was changed from ACCEPT to DROP, which causes outbound container traffic to stop working by default. See #40182 for the workaround.

The support for the v1 registries was removed.

The issue linked there kubernetes/kubernetes#40182 looks like it should've been fixed already since it's from January and the pod networks should've had enough time to circumvent this.
Looking further at that issue I see that weave had addressed this (or at least something related to that) in weaveworks/weave#2758 which it has been merged several days ago. Sadly it's not released yet.
What do you think? Should we downgrade Docker to 1.12 for now, wait for weave or add iptable rules ourselves?

Also: Are you able to reproduce that? It can be simply reproduced by using one of the above mentioned images on a worker node and calling "nslookup" with an external domain. Then it should already give the wrong ip.

EDIT:
I noticed this strange DNS issue again after working with the cluster the whole day. Restarting the cluster solved it partly.

@rhuss
Copy link
Collaborator

rhuss commented Oct 23, 2017

Hmm, for me DNS seems to work also with Docker 17.03.2 without changing iptables (and using weave).

I get e.g. this on rpi-alpine:3.6

k exec -it test ash
/ # nslookup www.heise.de
nslookup: can't resolve '(null)': Name does not resolve

Name:      www.heise.de
Address 1: 193.99.144.85 www.heise.de
Address 2: 2a02:2e0:3fe:1001:7777:772e:2:85 www.heise.de

/ # ping www.heise.de
PING www.heise.de (193.99.144.85): 56 data bytes
64 bytes from 193.99.144.85: seq=0 ttl=248 time=13.606 ms
64 bytes from 193.99.144.85: seq=1 ttl=248 time=34.437 ms

which looks good to me. I don't know where the cant resolve (null) comes from

I used this pod for testing:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - image: "hypriot/rpi-alpine:3.6"
    name: sleep
    args:
    - "sleep"
    - "3600"

@rhuss
Copy link
Collaborator

rhuss commented Oct 23, 2017

Your PR for the flannel support is still pending, which includes updating the iptable rules.

I'm going to review the PR and try to move the iptables change up so that is should be globally applied.

@rhuss
Copy link
Collaborator

rhuss commented Oct 23, 2017

I just updated the scripts and added the iptables ruless for accepting forwards to the top-level kubernetes role.

I tested it with flannel and it worked without issues:

  • I could access other services internally by name
  • I could resolve external names via ping

I suspect you might have a different setup and might not have a proper NAT routing setup on your desktop so that the cluster node can reach the outside (and return packages are routed properly back to the node). For Mac OS X I had to setup the proper forwarding rules with https://github.com/Project31/ansible-kubernetes-openshift-pi3/blob/master/tools/setup_nat_on_osx.sh

@Dbzman
Copy link
Author

Dbzman commented Oct 24, 2017

Thanks for checking!
I think I found the issue. A collegue was pointing me to that yesterday.

So, what I experienced so far:

  • DNS sometimes suddenly stops working for most of my containers (I only get "62.138.239.45" for many hosts)
  • This IP address is the DNS error page of Telekom
  • Restarting the Cluster helps solving the problem

I thought that DNS did not work because I always got that IP back. In reality it worked properly but my internet provider returned that specific IP on failed DNS requests.

What happens in Kubernetes here is:

  • A container queries kube-dns to resolve a DNS name
  • kube-dns calls the nameserver and gets the above mentioned IP back (instead of an real error)
  • It's a valid IP, so kube-dns chaches this IP in dnsmasq
  • This wrong IP is now registered for a very long time for the requested host name

https://www.heise.de/newsticker/meldung/Telekom-leitet-DNS-Fehlermeldungen-um-213726.html
Luckily this behaviour can be turned off.
I did that and it works pretty well so I'll upgrade docker again.

Not sure if we really need that global iptables rules. It looks like my issues were completely related to the behaviour above.

Maybe we should mention somewhere that some providers can cause DNS problems since people will probably use this RaspberryPi setup also at home.

Also: Was it intentional that you switchted the default to flannel, again?^^

One last thing: Thank you so much for these scripts. It's so convenient and easy to provision a RaspberryPi cluster with that. You did an awesome job here! :)

@Dbzman Dbzman closed this as completed Oct 24, 2017
@rhuss
Copy link
Collaborator

rhuss commented Oct 24, 2017

Thanks for the compliments and that you like the playbooks ;-) 'hope I can continue to work on it soon, so to add an ingress controller (traefik) and rook for persistent volumes.

wrt the CNI plugin, I have no strong bias, so occasionally I switch back and forth ;-). In this case, I want to try whether flannel runs stable, too. Let's keep the iptable forward rules top-level, I don't think they harm (and are required for flannel anyway).

BTW, some time ago I had the same very same issue with the Deutsche Telekom DNS server. IMO this behaviour is totally bogus and since DNS issues are always hard to debug, this makes it even harder. I will add a FAQ section to the README and add this information.

thanks ...

@Dbzman
Copy link
Author

Dbzman commented Oct 24, 2017

I got ingress working on my setup already. Maybe I'll create pull request in the next days since I want to reprovision anyways for the Docker update. :)
Persistent volumes would be really nice. Currently I mount the host path which works but is not ideal.

Okay, I don't have a preference for the CNI as well. ;) At least it's always good to have the option to switch it if one of them has certain bugs or so.

I'm really happy that I got this solved now. I also think that behaviour is totally wrong as they completely violate the DNS protocol with that. Interesting that you had that issue was well before. :D

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants