Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

DNS resolution of memcached service fails #1591

Closed
dilshad18 opened this issue Dec 7, 2018 · 6 comments
Closed

DNS resolution of memcached service fails #1591

dilshad18 opened this issue Dec 7, 2018 · 6 comments
Labels

Comments

@dilshad18
Copy link

Following error is happening in flux installation inside minikube instance:

component=memcached err="error updating memcache servers: lookup 172-17-0-3.flux-memcached.flux.svc.cluster.local. on 10.96.0.10:53: no such host"

This is a rather fresh installation and only few files have been applied. We update a single file and it tries to update that and it fails,

@squaremo
Copy link
Member

It looks like it either can't reach the DNS server, or doesn't get a result back. The hostname there is a bit odd -- should it include the hyphenated IP address like that? I would expect just flux-memcached.flux.svc.cluster.local.

@squaremo squaremo changed the title Getting following error DNS resolution of memcached service fails Dec 11, 2018
@johnraz
Copy link
Contributor

johnraz commented Dec 13, 2018

For what is worth I experience the exact same behavior on minikube and flux deployed with the helm chart.

kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T09:56:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Flux components versions:

quay.io/weaveworks/flux:1.8.1
quay.io/weaveworks/helm-operator:0.5.1
memcached:1.4.25

Every pods are green and running.

The logs:

ts=2018-12-13T16:16:24.981315793Z caller=memcached.go:153 component=memcached err="error updating memcache servers: lookup 3735393134393938.flux-memcached.flux.svc.cluster.local. on 10.96.0.10:53: no such host"
ts=2018-12-13T16:17:24.981598994Z caller=memcached.go:153 component=memcached err="error updating memcache servers: lookup 172-17-0-16.flux-memcached.flux.svc.cluster.local. on 10.96.0.10:53: no such host"

I'll try to read a bit more about how the service discovery works in memcache and see if I can come up with an explanation...

@johnraz
Copy link
Contributor

johnraz commented Dec 13, 2018

I can reach the memcache container from the fluxd container by hitting the hostname passed to fluxd:

ps aux gives (truncated by me):

fluxd --ssh-keygen-dir=/var/fluxd/keygen --k8s-secret-name=flux-git-deploy --memcached-hostname=flux-memcached ...

Telnet session from the fluxd container gives (again truncated by me):

/home/flux # telnet flux-memcached 11211
stats
STAT pid 1
STAT uptime 63144
STAT time 1544719316
...

@johnraz
Copy link
Contributor

johnraz commented Dec 13, 2018

Digging some more shows that the cache seems to be used:

From a memcache telnet session, dumping an item gives:

stats cachedump 32 0
ITEM registryrepov3|quay.io/weaveworks/helm-operator [98187 b; 1545931042 s]
END

So I would say the memcache server list is properly provisioned with the valid hostname and some "ghost" hostnames are trying to get in and are rejected because they can't resolve...

It most likely fails here:
https://github.com/weaveworks/flux/blob/113e1280a27a4cec80465d1d0d0c69b696839f80/registry/cache/memcached/memcached.go#L164

Or there:
https://github.com/bradfitz/gomemcache/blob/1952afaa557dc08e8e0d89eafab110fb501c1a2b/memcache/selector.go#L59-L90

How they get there is a mystery to me so far...

@squaremo do you have any clue how those weird records could get there?

Should we add a note to the FAQ to let people know that this is "ok" and doesn't break the cache?

@dilshad18 could you check in your own setup if you do have something in memcache (if you are not used to memcache, I followed this blog post and it helped me get used to it)

@squaremo
Copy link
Member

So I would say the memcache server list is properly provisioned with the valid hostname and some "ghost" hostnames are trying to get in and are rejected because they can't resolve...

Yes, that sounds like a good diagnosis. Filling in some details, after looking in the Kubernetes docs re service discovery:

The way it's set up in the example deployment (and chart, and Weave Cloud config, ...) is that memcached has a headless service. According to https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services (also see https://github.com/kubernetes/dns/blob/master/docs/specification.md), in DNS there will be:

  • An A record for memcached.namespace.svc.cluster.local with the IP of each ready pod;
  • An A record for .memcached.namespace.svc.cluster.local for each pod, where is a generated name;
  • A SRV record which has the auto-generated hostname, and the port, for each pod in the service

With the arguments given in the example deployment of fluxd, it'll query for the SRV records, then the memcache client code (linked above) will query the IP of each host mentioned. So where it's failing is in that second bit -- it can't resolve some or all of the hosts it got from the SRV records.

But the question remains: how did those unresolveable endpoints get there in the first place? That I don't know :-(

@squaremo
Copy link
Member

BTW it is entirely fine to give the memcached service a clusterIP (i.e., don't set it to None), and not supply the memcached-service argument -- this will make it just resolve the service address, rather than go through SRV records etc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants