-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LCOW: Intermittent DNS resolution failures with Alpine containers #2371
Comments
I just verified that I'm not seeing the same behavior with Alpine 3.9 on my Mac Docker (I get the reverse pointer fail of Not sure who the right MS folks are to contact - @jhowardmsft or @jterry75? This might end up being a ticket to file in opengcs project. docker info
|
This updates the puppetserver tests to use `docker-compose` instead of `docker-run`. This also updates the tests to use the shared testing gem from github.com/puppetlabs/pupperware. This also includes a move from the puppet-agent-alpine to puppet-agent-ubuntu for testing. We were seeing a lot of intermittent network failures with the alpine container on windows (LCOW). See moby/libnetwork#2371 for the bug report. This should hopefully clear up the intermittent failures we were seeing.
This updates the puppetserver tests to use `docker-compose` instead of `docker-run`. This also updates the tests to use the shared testing gem from github.com/puppetlabs/pupperware. This also includes a move from the puppet-agent-alpine to puppet-agent-ubuntu for testing. We were seeing a lot of intermittent network failures with the alpine container on windows (LCOW). See moby/libnetwork#2371 for the bug report. This should hopefully clear up the intermittent failures we were seeing.
When testing with the `puppet/puppet-agent-alpine` image on windows systems with LCOW we had intermittent failures in DNS resolution that occurred fairly regularly. It seems to be specifically interaction between the base alpine (3.8 and 3.9) images with windows/LCOW. Two issues related to this issue are moby/libnetwork#2371 and microsoft/opengcs#303
This updates the puppetserver tests to use `docker-compose` instead of `docker-run`. This also updates the tests to use the shared testing gem from github.com/puppetlabs/pupperware. This also includes a move from the puppet-agent-alpine to puppet-agent-ubuntu for testing. We were seeing a lot of intermittent network failures with the alpine container on windows (LCOW). See moby/libnetwork#2371 for the bug report. This should hopefully clear up the intermittent failures we were seeing.
This updates the puppetserver tests to use `docker-compose` instead of `docker-run`. This also updates the tests to use the shared testing gem from github.com/puppetlabs/pupperware. This also includes a move from the puppet-agent-alpine to puppet-agent-ubuntu for testing. We were seeing a lot of intermittent network failures with the alpine container on windows (LCOW). See moby/libnetwork#2371 and microsoft/opengcs#303 have more information on this issue. This should hopefully clear up the intermittent name resolution failures we were seeing.
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 2549f19 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, preventing a need to synchronize a domain name. - Note that at this time there is also a discrepancy in how Facter 3 behaves vs Facter 2. The Facter 2 gem is being used by the `puppetserver ca` gem based application, and may return a different value for Facter.value('domain') than calling `facter domain` at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns the value `delivery.puppetlabs.net` This discrepancy makes it so that the `puppetserver ca` application cannot find the client side cert on disk and fails outright. Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases. For now, setting PUPPETSERVER_HOSTNAME configuration value in the puppetserver container will set the `puppet.conf` values explicitly to the desired DNS name to work around this problem. - Resolution of `postgres.internal` seems to rely on having the `hostname` value explicitly defined in the docker-compose file, even though hostname values supposedly don't interact with DNS in docker - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 2549f19 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, preventing a need to synchronize a domain name. - Note that at this time there is also a discrepancy in how Facter 3 behaves vs Facter 2. The Facter 2 gem is being used by the `puppetserver ca` gem based application, and may return a different value for Facter.value('domain') than calling `facter domain` at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns the value `delivery.puppetlabs.net` This discrepancy makes it so that the `puppetserver ca` application cannot find the client side cert on disk and fails outright. Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases. For now, setting PUPPETSERVER_HOSTNAME configuration value in the puppetserver container will set the `puppet.conf` values explicitly to the desired DNS name to work around this problem. - Resolution of `postgres.internal` seems to rely on having the `hostname` value explicitly defined in the docker-compose file, even though hostname values supposedly don't interact with DNS in docker - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 2549f19 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, preventing a need to synchronize a domain name. - Note that at this time there is also a discrepancy in how Facter 3 behaves vs Facter 2. The Facter 2 gem is being used by the `puppetserver ca` gem based application, and may return a different value for Facter.value('domain') than calling `facter domain` at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns the value `delivery.puppetlabs.net` This discrepancy makes it so that the `puppetserver ca` application cannot find the client side cert on disk and fails outright. Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases. For now, setting PUPPETSERVER_HOSTNAME configuration value in the puppetserver container will set the `puppet.conf` values explicitly to the desired DNS name to work around this problem. - Resolution of `postgres.internal` seems to rely on having the `hostname` value explicitly defined in the docker-compose file, even though hostname values supposedly don't interact with DNS in docker - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 2549f19 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, inside of puppet.conf, preventing a need to inject a domain name as was done previously. This is necessary because of a discrepancy in how Facter 3 behaves vs Facter 2, which creates a mismatch between how the host cert is initially generated (using Facter 3) and how `puppetserver ca` finds the files on disk (using Facter 2), that setting PUPPETSERVER_HOSTNAME will explicitly work around. Specifically, Facter 2 may return a different Facter.value('domain') than calling `facter domain` using Facter 3 at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net` Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert files on disk get written as *.delivery.puppetlabs.net, yet the `puppetserver ca` application looks for the client certs on disk as *.ops.puppetlabs.net, which causes `puppetserver ca` to fail. - Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases, which may remove the need for the above. - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 2549f19 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, inside of puppet.conf, preventing a need to inject a domain name as was done previously. This is necessary because of a discrepancy in how Facter 3 behaves vs Facter 2, which creates a mismatch between how the host cert is initially generated (using Facter 3) and how `puppetserver ca` finds the files on disk (using Facter 2), that setting PUPPETSERVER_HOSTNAME will explicitly work around. Specifically, Facter 2 may return a different Facter.value('domain') than calling `facter domain` using Facter 3 at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net` Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert files on disk get written as *.delivery.puppetlabs.net, yet the `puppetserver ca` application looks for the client certs on disk as *.ops.puppetlabs.net, which causes `puppetserver ca` to fail. - Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases, which may remove the need for the above. - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder - Additionally update k8s / Bolt specs to be consistent with updated naming
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro at : moby/libnetwork#2371 - While this may increase the image size a bit, the goal here is reliability and robustness - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
@Iristyle Thank you very much for this detailed description of the problem, and workaround. In case the following is useful context for anyone looking to fix this issue, I see this issue while running:
(I am running this antique docker version because to the best of my knowledge it is the only one that supports LCOW on Win Server.) I have a sneaking suspicion that the culprit is busybox, and that using version 1.28 will work; 1.29 is broken in terms of nslookup. |
can you ptal @pradipd |
@daschott - Would you mind triaging? |
@mamezgeb what is the latest supported Docker version with LCOW on Server? @3dbrows is running old 17.10 preview version. I know we have https://docs.docker.com/docker-for-windows/wsl-tech-preview/ for Desktop and experimental feature on Docker-CE, but what is the current recommendation for server? @3dbrows did it work to try older busybox image? Do other container images not work as well? |
@daschott Busybox 1.28 works (wider discussion here: docker-library/busybox#48). I've seen this nslookup problem in any image containing this version of busybox. My workaround is (on container startup) to use a command that installs
I specify my DNS resolver (for |
Thanks @3dbrows for confirming. Is it possible at all to rebuild on top of busybox 1.28? @Iristyle the point that this appears to work reliably on ubuntu image + on alpine dig works but only nslookup fails is interesting. Can you confirm which busybox version is being used by alpine? Dig works reliably? |
@daschott Could try that, best way might be to obtain latest busybox by upgrading the Alpine base image - the version numbers are as follows: Alpine 3.10 has busybox 1.30.1-r3: https://pkgs.alpinelinux.org/packages?name=busybox&branch=v3.10 I do not right this minute have access to my LCOW/WinServer box, but I imagine a good test would be to take @Iristyle 's script above and modify like this: Expected to fail:
Expected to work: |
When testing with the `puppet/puppet-agent-alpine` image on windows systems with LCOW we had intermittent failures in DNS resolution that occurred fairly regularly. It seems to be specifically interaction between the base alpine (3.8 and 3.9) images with windows/LCOW. Two issues related to this issue are moby/libnetwork#2371 and microsoft/opengcs#303
- Remove the domain introspection / setting of AZURE_DOMAIN env var as this does not work as originally thought. Instead, hardcode the DNS suffix `.internal` to each service in the compose stack, and make sure that `dns_search` for `internal` will use the Docker DNS resolver when dealing with these hosts. Note that these compose file settings only affect the configuration of the DNS resolver, *not* resolv.conf. This is different from the docker run behavior, which *does* modify resolv.conf. Also note, config file locations vary depending on whether or not systemd is running in the container. It's not "safe" to refer to services in the cluster by only their short service names like `puppet`, `puppetdb` or `postgres` as they can conflict with hosts on the external network with these names when `resolv.conf` appends DNS search suffixes. When docker compose creates the user defined network, it copies the DNS settings from the host to the `resolv.conf` in each of the containers. This often takes search domains from the outside network and applies them to containers. When network resolutions happen, any default search suffix will be applied to short names when the dns option for ndots is not set to 0. So for instance, given a `resolv.conf` that contains: search delivery.puppetlabs.net A DNS request for `puppet` becomes `puppet.delivery.puppetlabs.net` which will fail to resolve in the Docker DNS resolver, then be sent to the next DNS server in the `nameserver` list, which may resolve it to a different host in the external network. This behaves this way because `resolv.conf` also sets secondary DNS servers from the host. While it is possible to try and service requests for an external domain like `delivery.puppetlabs.net` with the embedded Docker DNS resolver, it's better to instead choose a domain suffix to use inside the cluster. There are some good details on how various network types configure: docker/for-linux#488 (comment) - Note that the .internal domain is typically not recommended for production given the only IANA reserved domains are .example, .test, .invalid or .localhost. However, given the DNS resolver is set to own the resolution of .internal, this is a compromise. In production its recommended to use a subdomain of a domain that you own, but that's not yet configurable in this compose file. A future commit will make this configurable. - Another workaround for this problem would be to set the ndots option in resolv.conf to 0 per the documentation at http://man7.org/linux/man-pages/man5/resolv.conf.5.html However that can't be done for two reasons: - docker-compose schema doesn't actually support setting DNS options docker/cli#1557 - k8s sets ndots to 5 by default, so we don't want to be at odds - A further, but implausible workaround would be to modify the host DNS settings to remove any search suffixes. - The original FQDN change being reverted in this commit was introduced in 8b38620 " Lastly, the Windows specific docker-compose.windows.yml sets up a custom alias in the "default" network so that an extra DNS name for puppetserver can be set based on the FQDN that Facter determines. Without this additional DNS reservation, the `puppetserver ca` command will be unable to connect to the REST endpoint. A better long-term solution is making sure puppetserver is setup to point to `puppet` as the host instead of an FQDN. " With the PUPPETSERVER_HOSTNAME value set on the puppetserver container, both certname and server are set to puppet.internal, inside of puppet.conf, preventing a need to inject a domain name as was done previously. This is necessary because of a discrepancy in how Facter 3 behaves vs Facter 2, which creates a mismatch between how the host cert is initially generated (using Facter 3) and how `puppetserver ca` finds the files on disk (using Facter 2), that setting PUPPETSERVER_HOSTNAME will explicitly work around. Specifically, Facter 2 may return a different Facter.value('domain') than calling `facter domain` using Facter 3 at the command line. Such is the case inside the puppet network, where Facter 2 returns `ops.puppetlabs.net` while Facter 3 returns `delivery.puppetlabs.net` Without explicitly setting PUPPETSERVER_HOSTNAME, this makes cert files on disk get written as *.delivery.puppetlabs.net, yet the `puppetserver ca` application looks for the client certs on disk as *.ops.puppetlabs.net, which causes `puppetserver ca` to fail. - Facter 2 should not be included in the puppetserver packages, and changes have been made to packaging for future releases, which may remove the need for the above. - This PR is also made possible by switching over to using the Ubuntu based container from the Alpine container (performed in a prior commit), due to DNS resolution problems with Alpine inside LCOW: moby/libnetwork#2371 microsoft/opengcs#303 - Another avenue that was investigated to resolve the DNS problem in Alpine was to feed host:ip mappings in through --add-host, but it turns out that Windows doesn't yet support that feature per docker/for-win#1455 - Finally, these changes are also made in preparation of switching the pupperware-commercial repo over to a private builder - Additionally update k8s / Bolt specs to be consistent with updated naming
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
- Alpine seems to still be having issues with DNS resolutions inside an LCOW environment. In an effort to reduce these transient problems, switch the base container to a non-Alpine platform. A ticket has been filed with a repro for Alpine DNS issues under LCOW moby/libnetwork#2371 - While this may increase the image size by about 100MB, the goal here is reliability and robustness for the builder container: clojure:lein-alpine was about 142MB clojure:openjdk-8-lein is about 507MB for the target container: openjdk:8-jre-alpine was about 85MB openjdk:8-buster-slim is about 184MB - Ubuntu 18.04 shares a lineage with debian buster, which should be a well supported platform for PDB All OpenJDK container variants are listed at: https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links
Preface - I haven't yet debugged this issue enough to know precisely where the issue lies. I do know that I can very trivially reproduce the problem and wanted to at least get the ticket filed / conversation going. It may be related to some combination of:
I'm pretty sure this has something to do with Alpine in particular, since running the failing scenario with Ubuntu containers instead does not fail.
docker info
The LCOW image is built from linuxkit/lcow@d5dfdbc - it includes kernel 4.19.27 amongst other bits. There is an updated kernel image PR that was merged containing newer versions of OpenGCS, Alpine, kernel and runc BUT when I built it, it didn't launch containers and I had to revert (more info in linuxkit/lcow#45 (comment))
compose file to demonstrate the problem
Output from
compose up
The problem is that DNS resolution failures occur pretty regularly - i.e.
foo
cannot resolvebar.internal
fail and vice versa. While the log also shows some successes, there are a number of failures as well (which vary depending on each run).Workaround
One way to workaround the problem is to have the Alpine container perform a
dig
against the host, which presumably will cache the DNS record for futurenslookup
callscompose file
Output from
compose up
The nslookup results have changed quite a bit from:
To
Here's a longer run from the above compose file showing that nslookup no longer fails intermittently.
Ubuntu results
Compose file
I'll spare the full log here, but switching to an Ubuntu container and
nslookup
succeeds from the onset:The text was updated successfully, but these errors were encountered: