Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues creating certificates for subdomain with route53 #1008

Open
armsby opened this issue Nov 12, 2019 · 23 comments · May be fixed by #1180
Open

Issues creating certificates for subdomain with route53 #1008

armsby opened this issue Nov 12, 2019 · 23 comments · May be fixed by #1180

Comments

@armsby
Copy link

armsby commented Nov 12, 2019

I have been trying to create a certificate using letsencrypt and route53 the certificate im trying to create is for 'server.sub.domain.com' when trying to use route53 it I get an error saying that it can not find the host zone id for sub.domain.com, I belive that is a bug as the domain it should be looking for is domain.com, and that does exist, there is no issues creating certificates for that domain.

I have also tested it with cloudflare for another domain and that works perfectly, so I belive that the problem is when the api call towards route53

@MichaelMure
Copy link

It's happening to me as well with v3.7.0.

@MichaelMure
Copy link

After some debugging, it looks to me that what is happening is:

  • code flow somehow end up in route53.DNSProvider.Present() at the start of the challenge
  • call to d.getHostedZoneID(fqdn) to figure out what the name of the hosted zone is
  • call graph continue to dns01.fetchSoaByFqdn() to perform recursive DNS call to see if there is a SOA record. For example for foo.bar.example.org, it will query foo.bar.example.org then bar.example.org then example.org, which should have this SOA record.
  • the problem happen when in dns01.fetchSoaByFqdn() a DNS query has a temporary failure (say, a timeout). This error is not handled there, it just skip the node in the domain.
  • if this failure happen at the domain that should have the SOA record (example.org), the function will end up returning org instead of `example.org
  • later, the AWS SDK call to find the Route53 hosted zone by name (ListHostedZonesByName) will be called with org instead of example.org and fail

@MichaelMure
Copy link

So to me this is not a Route53 provider failure, this is a dns01 one.

@ldez
Copy link
Member

ldez commented Jun 2, 2020

If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.

@ldez
Copy link
Member

ldez commented Jun 2, 2020

You can also simply configure the DNS timeout.

--dns-timeout

https://go-acme.github.io/lego/usage/cli/#usage

@ldez ldez added the question label Jun 2, 2020
@armsby
Copy link
Author

armsby commented Jun 3, 2020

it does not look to be a timeout issue, it is only when the host name is on a subdomain, I am updating certificates on both sides of the failures for the main domain and it is only the lego client that fails, I have no issues when using https://github.com/acmesh-official/acme.sh

@MichaelMure
Copy link

If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.

Sure, but that doesn't remove the fact that fetchSoaByFqdn doesn't handle well this kind of failure. This is especially a problem because if I'm not mistaken, DNS happen on UDP, that is without any guarantee of packet delivery.

Here is a screenshot where one of those failure happen:

Capture-20200603134905-1619x1284

In that case, the node being currently checked will be silently dropped and the function can return an incorrect result, that will cascade later in a bigger problem (complete failure of the certificate issuance).

@MichaelMure
Copy link

@armsby I got the exact same error message as you, that's even how I found this issue. The root cause might be something else than a timeout but if an error happen when doing a DNS query, you can eventually end up with this final error.

@ldez
Copy link
Member

ldez commented Jun 3, 2020

@MichaelMure so your problem is a timeout so you can use change the dnsTimeout:

client.Challenge.SetDNS01Provider(provider,dns01.AddDNSTimeout(30*time.Second))

or

--dns-timeout

@MichaelMure
Copy link

MichaelMure commented Jun 3, 2020

I understand that but that's only a band-aid on this problem. Networking is unreliable by nature, especially UDP. A DNS request can fail for different reasons and the code doing those requests should handle those errors properly if possible.

@ldez
Copy link
Member

ldez commented Jun 3, 2020

For me, the best way to handle timeout error is to configure dnsTimeout: this option is only for that, it's not band-aid.

@MichaelMure
Copy link

What if the UDP packet simply get lost or dropped somewhere on an unreliable connection? No amount of timeout will fix that and it will still show up as a timeout X minutes later.

@ldez
Copy link
Member

ldez commented Jun 3, 2020

func sendDNSQuery(m *dns.Msg, ns string) (*dns.Msg, error) {
udp := &dns.Client{Net: "udp", Timeout: dnsTimeout}
in, _, err := udp.Exchange(m, ns)
if in != nil && in.Truncated {
tcp := &dns.Client{Net: "tcp", Timeout: dnsTimeout}
// If the TCP request succeeds, the err will reset to nil
in, _, err = tcp.Exchange(m, ns)
}
return in, err
}

if in != nil && in.Truncated {
tcp := &dns.Client{Net: "tcp", Timeout: dnsTimeout}
// If the TCP request succeeds, the err will reset to nil
in, _, err = tcp.Exchange(m, ns)
}

@MichaelMure
Copy link

Note: I certainly don't want to start an argument and as a free software maintainer myself I know that sometimes people get ... inconsiderate. But we should be able to agree on how the code behave.

@MichaelMure
Copy link

MichaelMure commented Jun 3, 2020

My understanding of the code section you linked is that a TCP DNS query will be done as a fallback if the UDP reply is too big. But that implies having a valid UDP response so that doesn't handle a packet loss.

edit: this happen when the reply is > 512bits: https://serverfault.com/questions/587625/why-dns-through-udp-has-a-512-bytes-limit

@ldez
Copy link
Member

ldez commented Jun 3, 2020

Yes if not a fallback (I know the Truncated meaning) but it's not a simple DNS call.

Otherwise, create a fix without any information to reproduce the issue and create a blind fix seems to me not a good way to follow.
I can create a retry system but I need to understand why (currently, UDP by it-self is not enough for me)

@MichaelMure
Copy link

MichaelMure commented Jun 3, 2020

Ha I see.

Well, I do not know why this particular DNS query fail so often for me, I have an otherwise reliable internet connection. Maybe it's because the certificates I'm trying to generate have a lot of nodes (it's in the form of *.foo.bar.fuu.boo.example.org) ? Or maybe I'm just more exposed to this problem because I generate a bunch of those certs in a row.

In any case, the dns.Client.Exchange() function's doc state that:

// Exchange does not retry a failed query, nor will it fall back to TCP in
// case of truncation.

To me it implies that the possible failure is left to the caller to handle.

@MichaelMure
Copy link

I can of course test whatever solution you come up with and see if that fix the problem.

@ldez
Copy link
Member

ldez commented Jun 3, 2020

I will trying to create a retry system.

@ldez ldez self-assigned this Jun 3, 2020
@MichaelMure
Copy link

Thank you :)

@ldez ldez linked a pull request Jun 3, 2020 that will close this issue
@ldez
Copy link
Member

ldez commented Jun 3, 2020

@MichaelMure could you try #1180 ?

@MichaelMure
Copy link

I'll give it a try tomorrow. That looks like a good solution.

@MichaelMure
Copy link

I'm working from home today and I just don't get any timeout from there. I'll try again from this other place that apparently have less than optimal networking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants