Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing endpoints in localtargets.* A records #62

Closed
ytsarev opened this issue Mar 13, 2020 · 2 comments · Fixed by #68
Closed

Missing endpoints in localtargets.* A records #62

ytsarev opened this issue Mar 13, 2020 · 2 comments · Fixed by #68
Labels
bug Something isn't working
Milestone

Comments

@ytsarev
Copy link
Member

ytsarev commented Mar 13, 2020

Steps to reproduce

  • Deploy two cross-communicating ohmyglb setup locally
    $ make deploy-full-local-setup
  • Check generated localtargets.* dnsendpoint conf
$ kubectl -n test-gslb get dnsendpoints test-gslb -o yaml
...
spec:
  endpoints:
  - dnsName: localtargets.app3.cloud.example.com
    recordTTL: 30
    recordType: A
    targets:
    - 172.17.0.2
    - 172.17.0.3
    - 172.17.0.4
...
  • Check if coredns returns matching A records
dig +short @localhost localtargets.app3.cloud.example.com
172.17.0.2
172.17.0.4
172.17.0.3

This is expected result. After some time localtargets.* can 'lose' one of the records in the following way:

  • localtargets.* dnsendpoint conf is always consistent
$ kubectl -n test-gslb get dnsendpoints test-gslb -o yaml
...
spec:
  endpoints:
  - dnsName: localtargets.app3.cloud.example.com
    recordTTL: 30
    recordType: A
    targets:
    - 172.17.0.2
    - 172.17.0.3
    - 172.17.0.4
...
  • Meanwhile actual DNS response might lose one of the A records
dig +short @localhost localtargets.app3.cloud.example.com
172.17.0.2
172.17.0.4

Issue is not really deterministic in its behaviour . Meanwhile we faced it several times over multiple deployments
In case of 2 cluster setup only single cluster is affected effectively making exposed through coredns only 5 out of 6 k8s worker.

DNSEndpoint CR generation looks always correct so the problem is somewhere in etcd coredns backend area.

make debug-test-etcd can help in debugging this issue runtime.

@ytsarev ytsarev added this to the 0.6 milestone Mar 13, 2020
@ytsarev ytsarev added the bug Something isn't working label Mar 14, 2020
ytsarev added a commit that referenced this issue Mar 15, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-autmated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test app deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
ytsarev added a commit that referenced this issue Mar 15, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-autmated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test app deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
ytsarev added a commit that referenced this issue Mar 15, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-automated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test apps deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
ytsarev added a commit that referenced this issue Mar 15, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-automated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test apps deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
* All with single command of `$ make deploy-full-local-setup`
ytsarev added a commit that referenced this issue Mar 15, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-automated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test apps deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
* All with single command of `$ make deploy-full-local-setup`
donovanmuller pushed a commit that referenced this issue Mar 16, 2020
* During documenting #62 description I realized how it
  incredibly sucks in regarding of many manual or
  semi-automated steps and very error prone
* This PR introduces fully automated setup of 2 kind clusters
  with cross-communicating ohmyglb deployments on top.
* Test apps deployment included
* Make targets are still granular and we can deploy the thing
  step by step if we want
* Useful for reproduction of complex issues like #62 and overall
  local e2e testing of features that involve cross cluster communication
* In future can be reused in e2e pipelines for environment creation
* All with single command of `$ make deploy-full-local-setup`
@ytsarev
Copy link
Member Author

ytsarev commented Mar 18, 2020

Example of degraded localtargets from within etcd:

/ # etcdctl get '' --from-key
/skydns/com/example/cloud/failover/16fccd7b
{"host":"172.17.0.5","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/24f6061f
{"host":"172.17.0.3","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/6c871674
{"host":"172.17.0.5","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/13ee1210
{"host":"172.17.0.2","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/2485c7ba
{"host":"172.17.0.5","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/315f5958
{"host":"172.17.0.5","ttl":30,"targetstrip":1}

Notice the unexpected duplication of 172.17.0.5
At the same time related DNSEndpoint is fully ok:

k -n test-gslb get dnsendpoints -o yaml
....
    - dnsName: failover.cloud.example.com
      recordTTL: 30
      recordType: A
      targets:
      - 172.17.0.2
      - 172.17.0.3
      - 172.17.0.5

So the problem seems to be fully isolated to upstream components of external-dns+etcd ... coredns reads whatever it has in etcd backend so it is behaving correctly.
Digging into crd_source -> external-dns -> etcd propagation code ...

@ytsarev
Copy link
Member Author

ytsarev commented Mar 19, 2020

Made $ etcdctl del '' --from-key effectively wiping out local etcd database

Then it was gslb reconcile

External-dns logs during the even :

time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/4088e0a8 to Host=172.17.0.2, Text=, TTL=30"
time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/461535c9 to Host=172.17.0.3, Text=, TTL=30"
time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/68ec81be to Host=172.17.0.5, Text=, TTL=30"
time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/57c491cc to Host=172.17.0.2, Text=, TTL=30"
time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/5afed140 to Host=172.17.0.3, Text=, TTL=30"
time="2020-03-19T21:32:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/6a0a107c to Host=172.17.0.5, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/68ec81be to Host=172.17.0.2, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/68ec81be to Host=172.17.0.3, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/68ec81be to Host=172.17.0.5, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/6a0a107c to Host=172.17.0.2, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/6a0a107c to Host=172.17.0.3, Text=, TTL=30"
time="2020-03-19T21:33:52Z" level=info msg="Add/set key /skydns/com/example/cloud/failover/localtargets/6a0a107c to Host=172.17.0.5, Text=, TTL=30"

So the first chunk of the log looks totally correct - each entry at different path and different hash at the end. Consequently we observe weird output of different values ending up at the same hash.
Surprisingly after the wipe out and reconcile it did fix itself:

# etcdctl get '' --from-key
/skydns/com/example/cloud/failover/4088e0a8
{"host":"172.17.0.2","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/461535c9
{"host":"172.17.0.3","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/68ec81be
{"host":"172.17.0.5","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/57c491cc
{"host":"172.17.0.2","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/5afed140
{"host":"172.17.0.3","ttl":30,"targetstrip":1}
/skydns/com/example/cloud/failover/localtargets/6a0a107c
{"host":"172.17.0.5","ttl":30,"targetstrip":1}

No dups.
Something really weird happens here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant