-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No way to refresh DNS information leading to indefinite network failures #41570
Comments
Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby). |
An interesting decision from that Mozilla bug report is:
That seems pretty reasonable, and maybe something that Rust could do too? Specifically, we should probably do this in |
Opened a PR to libc over at rust-lang/libc#585 |
Sounds like a reasonable solution to me! (calling Thanks for looking into this @jonhoo! |
Do you think it'd be better to add this behavior into |
Nah I think throwing it into |
As discussed in rust-lang#41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Fixes rust-lang#41570. Depends on rust-lang/libc#585.
…excrichton Reload nameserver information on lookup failure As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Fixes #41570. Depends on rust-lang/libc#585. r? @alexcrichton
As discussed in rust-lang#41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Introduces an std linkage dependency on libresolv on macOS/iOS (which also makes it necessary to update run-make/tools.mk). Fixes rust-lang#41570. Depends on rust-lang/libc#585.
…-fail, r=alexcrichton Reload nameserver information on lookup failure As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Fixes rust-lang#41570. Depends on rust-lang/libc#585. r? @alexcrichton
Does anybody have a link for the upstream bug? Because programs, or even Rust runtime, are definitely not supposed to do this. So either:
|
@jan-hudec see #41582 for some further discussion. This is a bug in glibc (other libc implementations do not have this problem as they either do not cache, or they flush the cache when the set of nameservers change). It is reported upstream at https://sourceware.org/bugzilla/show_bug.cgi?id=984, but it seems unlikely that a fix will land any time soon. I would argue strongly against your first point above (further indicating that this is a bug): |
Oh, that's why I haven't seen the issue for ages—Debian carries a fix for it. |
Go's DNS resolution often defers to the libc implementation, and glibc's resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984 It will cache the contents of /etc/resolv.conf, which can put the client in a state where all DNS requests fail forever after a network change. The conditions where Go calls into libc are complicated and platform-specific, and the resolver cache involves thread-local state, so repros tend to be inconsistent. But when you hit this on your laptop on the subway or whatever, the effect is that everything is broken until you restart the process. One way to fix this would be to force using the pure-Go resolver (net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf every 5 seconds. I'm wary of doing that, because the Go devs went through an enormous amount of trouble to enable cgo fallback, for various platform- and environment-specific reasons. See all the comments in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the standard library. Instead, we're trying the same workaround that the Rust standard library chose, where we call libc::res_init() after DNS failures. See rust-lang/rust#41570. The downside here is that we have to remember to do this after we make network calls, and that we have to use cgo in the build, but the upside is that it should never break a DNS environment that was working before.
Go's DNS resolution often defers to the libc implementation, and glibc's resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984 It will cache the contents of /etc/resolv.conf, which can put the client in a state where all DNS requests fail forever after a network change. The conditions where Go calls into libc are complicated and platform-specific, and the resolver cache involves thread-local state, so repros tend to be inconsistent. But when you hit this on your laptop on the subway or whatever, the effect is that everything is broken until you restart the process. One way to fix this would be to force using the pure-Go resolver (net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf every 5 seconds. I'm wary of doing that, because the Go devs went through an enormous amount of trouble to enable cgo fallback, for various platform- and environment-specific reasons. See all the comments in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the standard library. Instead, we're trying the same workaround that the Rust standard library chose, where we call libc::res_init() after DNS failures. See rust-lang/rust#41570. The downside here is that we have to remember to do this after we make network calls, and that we have to use cgo in the build, but the upside is that it should never break a DNS environment that was working before.
Go's DNS resolution often defers to the libc implementation, and glibc's resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984 It will cache the contents of /etc/resolv.conf, which can put the client in a state where all DNS requests fail forever after a network change. The conditions where Go calls into libc are complicated and platform-specific, and the resolver cache involves thread-local state, so repros tend to be inconsistent. But when you hit this on your laptop on the subway or whatever, the effect is that everything is broken until you restart the process. One way to fix this would be to force using the pure-Go resolver (net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf every 5 seconds. I'm wary of doing that, because the Go devs went through an enormous amount of trouble to enable cgo fallback, for various platform- and environment-specific reasons. See all the comments in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the standard library. Instead, we're trying the same workaround that the Rust standard library chose, where we call libc::res_init() after DNS failures. See rust-lang/rust#41570. The downside here is that we have to remember to do this after we make network calls, and that we have to use cgo in the build, but the upside is that it should never break a DNS environment that was working before.
Go's DNS resolution often defers to the libc implementation, and glibc's resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984 It will cache the contents of /etc/resolv.conf, which can put the client in a state where all DNS requests fail forever after a network change. The conditions where Go calls into libc are complicated and platform-specific, and the resolver cache involves thread-local state, so repros tend to be inconsistent. But when you hit this on your laptop on the subway or whatever, the effect is that everything is broken until you restart the process. One way to fix this would be to force using the pure-Go resolver (net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf every 5 seconds. I'm wary of doing that, because the Go devs went through an enormous amount of trouble to enable cgo fallback, for various platform- and environment-specific reasons. See all the comments in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the standard library. Instead, we're trying the same workaround that the Rust standard library chose, where we call libc::res_init() after DNS failures. See rust-lang/rust#41570. The downside here is that we have to remember to do this after we make network calls, and that we have to use cgo in the build, but the upside is that it should never break a DNS environment that was working before.
For future reference, this was finally fixed in a recent glibc release. Though this workaround will probably need to be in place for a while longer. |
Consider the following simple network client:
This works fine if you run it while your internet connection is up and running. However, if you kill your network connection, it (obviously) does not. What is interesting is if you launch the program while your internet is offline (and crucially, while
/etc/resolv.conf
does not contain any nameservers), and then connect to the internet again. I would expect the program to eventually say "connected", however this is not the case.This had me puzzle for a while, until I stumbled on this old issue on the Pidgin bug tracker. It turns out that the set of nameservers available when the program is started is cached, and is never automatically re-read. Instead,
res_init
must be called manually to refresh the nameserver list. Unfortunately, as far as I can tell, there is no way in Rust to callres_init
, and thus the above program simply cannot be made to work in the presence of network failures.It's not entirely clear what the "right" fix here is: we could simply providing a way to call
res_init
, or we could do something more fancy like a specialconnect_uncached
that does it for you. Regardless, this seems like a fairly unfortunate shortcoming..The text was updated successfully, but these errors were encountered: