Skip to content
This repository has been archived by the owner on Jul 28, 2021. It is now read-only.

Intermittent DNS failures when running Alpine containers in user-defined docker-compose network #303

Open
Iristyle opened this issue May 3, 2019 · 0 comments

Comments

@Iristyle
Copy link

Iristyle commented May 3, 2019

This is a cross-post from moby/libnetwork#2371 as I don't know where the bug lies.

In my environment, I am able to reproduce DNS resolution failures minimally with the following compose file when running LCOW.

version: '3'

services:
  foo:
    image: alpine:latest
    dns_search: internal
    entrypoint: sh -c "while true; do nslookup bar.internal && sleep 1s; done"
    networks:
      default:
        aliases:
         - foo.internal

  bar:
    image: alpine:latest
    dns_search: internal
    entrypoint: sh -c "while true; do nslookup foo.internal && sleep 1s; done"
    networks:
      default:
        aliases:
         - bar.internal

docker-compose up will yield something like the following, noting failures like bar_1 | nslookup: can't resolve 'foo.internal': Name does not resolve and foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve mixed in with successful resolutions:

PS C:\source\alpine-test> docker-compose -f .\docker-compose-bad.yml up
Creating network "alpine-test_default" with the default driver
Creating alpine-test_bar_1 ... done
Creating alpine-test_foo_1 ... done
Attaching to alpine-test_foo_1, alpine-test_bar_1
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  |
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  |
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  |
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  |
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  |
bar_1  | nslookup: can't resolve 'foo.internal': Name does not resolve
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  |
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  |
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  |
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | nslookup: can't resolve 'foo.internal': Name does not resolve
bar_1  |
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  |
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  |
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | nslookup: can't resolve 'bar.internal': Name does not resolve
foo_1  |
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1  |
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25
foo_1  |
foo_1  | nslookup: can't resolve '(null)': Name does not resolve
foo_1  | Name:      bar.internal
foo_1  | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1  |
bar_1  | Name:      foo.internal
bar_1  | Address 1: 172.18.67.25
bar_1  | nslookup: can't resolve '(null)': Name does not resolve
Gracefully stopping... (press Ctrl+C again to force)

I can run this compose stack on OSX and it does not fail. If I switch to an ubuntu container from Alpine, the resolutions don't fail.

I can at least workaround the problem a bit by modifying the compose file to first perform a dig against the host like this:

version: '3'

services:
  foo:
    image: alpine:latest
    dns_search: internal
    entrypoint: sh -c "apk add bind-tools; dig bar.internal; while true; do nslookup bar.internal; sleep 2s; done"
    networks:
      default:
        aliases:
         - foo.internal

  bar:
    image: alpine:latest
    dns_search: internal
    entrypoint: sh -c "apk add bind-tools; dig foo.internal; while true; do nslookup foo.internal; sleep 2s; done"
    networks:
      default:
        aliases:
         - bar.internal

The nslookup: can't resolve '(null)': Name does not resolve in the original case is reported to be unnecessary per gliderlabs/docker-alpine#476 (comment), but after performing a dig that message changes and resolutions look like:

bar_1  | Server:                172.25.128.1
bar_1  | Address:       172.25.128.1#53
bar_1  |
bar_1  | Non-authoritative answer:
bar_1  | Name:  foo.internal
bar_1  | Address: 172.25.139.149
bar_1  |

My host is as follows

Client:
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0-beta2)
  buildx: Build with BuildKit (Docker Inc., v0.2.0-6-g509c4b6-tp)

Server:
 Containers: 2
  Running: 0
  Paused: 0
  Stopped: 2
 Images: 138
 Server Version: master-dockerproject-2019-04-28
 Storage Driver: windowsfilter (windows) lcow (linux)
  Windows:
  LCOW:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics l2bridge l2tunnel nat null overlay transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: hyperv
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Windows 10 Enterprise Version 1809 (OS Build 17763.437)
 OSType: windows
 Architecture: x86_64
 CPUs: 2
 Total Memory: 16GiB
 Name: ci-lcow-prod-1
 ID: 0ac02c9d-aaba-42f4-8749-5a64af3068d8
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

The LCOW image is built from linuxkit/lcow@d5dfdbc - I tried the latest merged PR, but it didn't launch containers and I had to revert (more info in linuxkit/lcow#45 (comment))

There are some further details in the original issue I filed at moby/libnetwork#2371

underscorgan pushed a commit to underscorgan/pupperware that referenced this issue May 3, 2019
When testing with the `puppet/puppet-agent-alpine` image on windows
systems with LCOW we had intermittent failures in DNS resolution that
occurred fairly regularly. It seems to be specifically interaction
between the base alpine (3.8 and 3.9) images with windows/LCOW.

Two issues related to this issue are
moby/libnetwork#2371 and
microsoft/opengcs#303
underscorgan pushed a commit to underscorgan/puppetserver that referenced this issue May 3, 2019
This updates the puppetserver tests to use `docker-compose` instead of
`docker-run`. This also updates the tests to use the shared testing
gem from github.com/puppetlabs/pupperware.

This also includes a move from the puppet-agent-alpine to
puppet-agent-ubuntu for testing. We were seeing a lot of intermittent
network failures with the alpine container on windows (LCOW). See
moby/libnetwork#2371 and
microsoft/opengcs#303 have more information on
this issue. This should hopefully clear up the intermittent name
resolution failures we were seeing.
Iristyle added a commit to Iristyle/pupperware that referenced this issue May 3, 2019
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 2549f19

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   preventing a need to synchronize a domain name.

 - Note that at this time there is also a discrepancy in how Facter 3
   behaves vs Facter 2.

   The Facter 2 gem is being used by the `puppetserver ca` gem based
   application, and may return a different value for
   Facter.value('domain') than calling `facter domain` at the command
   line.  Such is the case inside the puppet network, where Facter 2
   returns `ops.puppetlabs.net` while Facter 3 returns the value
   `delivery.puppetlabs.net`

   This discrepancy makes it so that the `puppetserver ca` application
   cannot find the client side cert on disk and fails outright.

   Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases.

   For now, setting PUPPETSERVER_HOSTNAME configuration value in the
   puppetserver container will set the `puppet.conf` values explicitly
   to the desired DNS name to work around this problem.

 - Resolution of `postgres.internal` seems to rely on having the
   `hostname` value explicitly defined in the docker-compose file, even
   though hostname values supposedly don't interact with DNS in docker

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder
Iristyle added a commit to Iristyle/pupperware that referenced this issue May 4, 2019
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 2549f19

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   preventing a need to synchronize a domain name.

 - Note that at this time there is also a discrepancy in how Facter 3
   behaves vs Facter 2.

   The Facter 2 gem is being used by the `puppetserver ca` gem based
   application, and may return a different value for
   Facter.value('domain') than calling `facter domain` at the command
   line.  Such is the case inside the puppet network, where Facter 2
   returns `ops.puppetlabs.net` while Facter 3 returns the value
   `delivery.puppetlabs.net`

   This discrepancy makes it so that the `puppetserver ca` application
   cannot find the client side cert on disk and fails outright.

   Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases.

   For now, setting PUPPETSERVER_HOSTNAME configuration value in the
   puppetserver container will set the `puppet.conf` values explicitly
   to the desired DNS name to work around this problem.

 - Resolution of `postgres.internal` seems to rely on having the
   `hostname` value explicitly defined in the docker-compose file, even
   though hostname values supposedly don't interact with DNS in docker

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder
Iristyle added a commit to Iristyle/puppetdb that referenced this issue May 4, 2019
 - Unexpectedly, a Travis failure was also encountered where 30 seconds
   of running `host postgres.internal` failed, but the immediately
   subsequent call to `dig postgres.internal` succeeded.

   Running dig seems to prime a local cache, so perform a dig prior
   to host in an effort to help fix this problem, given the PDB
   container is based on Alpine

   microsoft/opengcs#303
Iristyle added a commit to Iristyle/puppetdb that referenced this issue May 4, 2019
 - Unexpectedly, a Travis failure was also encountered where 30 seconds
   of running `host postgres.internal` failed, but the immediately
   subsequent call to `dig postgres.internal` succeeded.

   Running dig seems to prime a local cache, so perform a dig prior
   to host in an effort to help fix this problem, given the PDB
   container is based on Alpine

   microsoft/opengcs#303
Iristyle added a commit to puppetlabs/puppetdb that referenced this issue May 4, 2019
 - Unexpectedly, a Travis failure was also encountered where 30 seconds
   of running `host postgres.internal` failed, but the immediately
   subsequent call to `dig postgres.internal` succeeded.

   Running dig seems to prime a local cache, so perform a dig prior
   to host in an effort to help fix this problem, given the PDB
   container is based on Alpine

   microsoft/opengcs#303
Iristyle added a commit to Iristyle/pupperware that referenced this issue May 4, 2019
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 2549f19

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   preventing a need to synchronize a domain name.

 - Note that at this time there is also a discrepancy in how Facter 3
   behaves vs Facter 2.

   The Facter 2 gem is being used by the `puppetserver ca` gem based
   application, and may return a different value for
   Facter.value('domain') than calling `facter domain` at the command
   line.  Such is the case inside the puppet network, where Facter 2
   returns `ops.puppetlabs.net` while Facter 3 returns the value
   `delivery.puppetlabs.net`

   This discrepancy makes it so that the `puppetserver ca` application
   cannot find the client side cert on disk and fails outright.

   Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases.

   For now, setting PUPPETSERVER_HOSTNAME configuration value in the
   puppetserver container will set the `puppet.conf` values explicitly
   to the desired DNS name to work around this problem.

 - Resolution of `postgres.internal` seems to rely on having the
   `hostname` value explicitly defined in the docker-compose file, even
   though hostname values supposedly don't interact with DNS in docker

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder
Iristyle added a commit to Iristyle/pupperware that referenced this issue May 6, 2019
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 2549f19

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   inside of puppet.conf, preventing a need to inject a domain name as
   was done previously.

   This is necessary because of a discrepancy in how Facter 3 behaves vs
   Facter 2, which creates a mismatch between how the host cert is
   initially generated (using Facter 3) and how `puppetserver ca`
   finds the files on disk (using Facter 2), that setting
   PUPPETSERVER_HOSTNAME will explicitly work around.

   Specifically, Facter 2 may return a different Facter.value('domain')
   than calling `facter domain` using Facter 3 at the command line.
   Such is the case inside the puppet network, where Facter 2 returns
   `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net`

	 Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert
   files on disk get written as *.delivery.puppetlabs.net, yet the
   `puppetserver ca` application looks for the client certs on disk as
   *.ops.puppetlabs.net, which causes `puppetserver ca` to fail.

 - Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases, which may
   remove the need for the above.

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder
Iristyle added a commit to Iristyle/pupperware that referenced this issue May 6, 2019
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 2549f19

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   inside of puppet.conf, preventing a need to inject a domain name as
   was done previously.

   This is necessary because of a discrepancy in how Facter 3 behaves vs
   Facter 2, which creates a mismatch between how the host cert is
   initially generated (using Facter 3) and how `puppetserver ca`
   finds the files on disk (using Facter 2), that setting
   PUPPETSERVER_HOSTNAME will explicitly work around.

   Specifically, Facter 2 may return a different Facter.value('domain')
   than calling `facter domain` using Facter 3 at the command line.
   Such is the case inside the puppet network, where Facter 2 returns
   `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net`

	 Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert
   files on disk get written as *.delivery.puppetlabs.net, yet the
   `puppetserver ca` application looks for the client certs on disk as
   *.ops.puppetlabs.net, which causes `puppetserver ca` to fail.

 - Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases, which may
   remove the need for the above.

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder

 - Additionally update k8s / Bolt specs to be consistent with updated
   naming
Iristyle added a commit to Iristyle/puppetdb that referenced this issue Oct 28, 2019
 - Unexpectedly, a Travis failure was also encountered where 30 seconds
   of running `host postgres.internal` failed, but the immediately
   subsequent call to `dig postgres.internal` succeeded.

   Running dig seems to prime a local cache, so perform a dig prior
   to host in an effort to help fix this problem, given the PDB
   container is based on Alpine

   microsoft/opengcs#303
Iristyle pushed a commit to puppetlabs/pupperware that referenced this issue Apr 9, 2021
When testing with the `puppet/puppet-agent-alpine` image on windows
systems with LCOW we had intermittent failures in DNS resolution that
occurred fairly regularly. It seems to be specifically interaction
between the base alpine (3.8 and 3.9) images with windows/LCOW.

Two issues related to this issue are
moby/libnetwork#2371 and
microsoft/opengcs#303
Iristyle added a commit to puppetlabs/pupperware that referenced this issue Apr 9, 2021
 - Remove the domain introspection / setting of AZURE_DOMAIN env var
   as this does not work as originally thought.

   Instead, hardcode the DNS suffix `.internal` to each service in the
   compose stack, and make sure that `dns_search` for `internal` will
   use the Docker DNS resolver when dealing with these hosts. Note that
   these compose file settings only affect the configuration of the
   DNS resolver, *not* resolv.conf. This is different from the
   docker run behavior, which *does* modify resolv.conf. Also note,
   config file locations vary depending on whether or not systemd is
   running in the container.

   It's not "safe" to refer to services in the cluster by only their
   short service names like `puppet`, `puppetdb` or `postgres` as they
   can conflict with hosts on the external network with these names
   when `resolv.conf` appends DNS search suffixes.

   When docker compose creates the user defined network, it copies the
   DNS settings from the host to the `resolv.conf` in each of the
   containers. This often takes search domains from the outside network
   and applies them to containers.

   When network resolutions happen, any default search suffix will be
   applied to short names when the dns option for ndots is not set to 0.
   So for instance, given a `resolv.conf` that contains:

   search delivery.puppetlabs.net

   A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net`
   which will fail to resolve in the Docker DNS resolver, then be sent
   to the next DNS server in the `nameserver` list, which may resolve it
   to a different host in the external network. This behaves this way
   because `resolv.conf` also sets secondary DNS servers from the host.

   While it is possible to try and service requests for an external
   domain like `delivery.puppetlabs.net` with the embedded Docker DNS
   resolver, it's better to instead choose a domain suffix to use inside
   the cluster.

   There are some good details on how various network types configure:
   docker/for-linux#488 (comment)

 - Note that the .internal domain is typically not recommended for
   production given the only IANA reserved domains are .example, .test,
   .invalid or .localhost. However, given the DNS resolver is set to
   own the resolution of .internal, this is a compromise.

   In production its recommended to use a subdomain of a domain that
   you own, but that's not yet configurable in this compose file. A
   future commit will make this configurable.

 - Another workaround for this problem would be to set the ndots option
   in resolv.conf to 0 per the documentation at
   http://man7.org/linux/man-pages/man5/resolv.conf.5.html

   However that can't be done for two reasons:

   - docker-compose schema doesn't actually support setting DNS options
     docker/cli#1557

   - k8s sets ndots to 5 by default, so we don't want to be at odds

 - A further, but implausible workaround would be to modify the host DNS
   settings to remove any search suffixes.

 - The original FQDN change being reverted in this commit was introduced
   in 8b38620

   "
   Lastly, the Windows specific docker-compose.windows.yml sets up a
   custom alias in the "default" network so that an extra DNS name for
   puppetserver can be set based on the FQDN that Facter determines.
   Without this additional DNS reservation, the `puppetserver ca`
   command will be unable to connect to the REST endpoint.

   A better long-term solution is making sure puppetserver is setup to
   point to `puppet` as the host instead of an FQDN.
   "

   With the PUPPETSERVER_HOSTNAME value set on the puppetserver
   container, both certname and server are set to puppet.internal,
   inside of puppet.conf, preventing a need to inject a domain name as
   was done previously.

   This is necessary because of a discrepancy in how Facter 3 behaves vs
   Facter 2, which creates a mismatch between how the host cert is
   initially generated (using Facter 3) and how `puppetserver ca`
   finds the files on disk (using Facter 2), that setting
   PUPPETSERVER_HOSTNAME will explicitly work around.

   Specifically, Facter 2 may return a different Facter.value('domain')
   than calling `facter domain` using Facter 3 at the command line.
   Such is the case inside the puppet network, where Facter 2 returns
   `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net`

	 Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert
   files on disk get written as *.delivery.puppetlabs.net, yet the
   `puppetserver ca` application looks for the client certs on disk as
   *.ops.puppetlabs.net, which causes `puppetserver ca` to fail.

 - Facter 2 should not be included in the puppetserver packages, and
   changes have been made to packaging for future releases, which may
   remove the need for the above.

 - This PR is also made possible by switching over to using the Ubuntu
   based container from the Alpine container (performed in a prior
   commit), due to DNS resolution problems with Alpine inside LCOW:

   moby/libnetwork#2371
   microsoft/opengcs#303

 - Another avenue that was investigated to resolve the DNS problem in
   Alpine was to feed host:ip mappings in through --add-host, but it
   turns out that Windows doesn't yet support that feature per

   docker/for-win#1455

 - Finally, these changes are also made in preparation of switching the
   pupperware-commercial repo over to a private builder

 - Additionally update k8s / Bolt specs to be consistent with updated
   naming
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant