Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log-servfail fails to log host SERVFAIL responses in Unbound 1.19.2 on Ubuntu 24.04.1 LTS #1193

Closed
shubhhq opened this issue Nov 27, 2024 · 7 comments · Fixed by #1198
Closed
Assignees

Comments

@shubhhq
Copy link

shubhhq commented Nov 27, 2024

Describe the bug
On Ubuntu 24.04.1 LTS Noble, log-servfail is not working for SERVFAIL queries, while log-replies continues to function normally.

To reproduce

  • Turn on log-servfail.
  • On unbound 1.19.2 running on Ubuntu 24.04.1 LTS.
  • Perform DNS queries that result in SERVFAIL.
  • Check unbound logs for SERVFAIL entries.

Expected behavior
SERVFAIL responses should be logged when log-servfail is enabled.

System:

  • Unbound version: 1.19.2
  • OS: Ubuntu 24.04.1 LTS
  • unbound -V output:
Version 1.19.2

Configure line: --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --enable-cachedb --with-libhiredis --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --disable-flto --enable-tfo-server
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024
Linked modules: dns64 python cachedb subnetcache respip validator iterator
TCP Fastopen feature available

BSD licensed, see LICENSE in source package for details.
Report bugs to [email protected] or https://github.com/NLnetLabs/unbound/issues

Additional information

  1. log-replies works correctly for all verbosity levels.
  2. SERVFAIL errors for CNAME records do generate error logs, but this behavior is inconsistent across our infrastructure.
  3. The issue persists even after adjusting system parameters like net.core.wmem_max and net.core.rmem_max.
  4. We've verified that log files are accessible and writable, since log-replies is working fine.
  5. We've tested with different verbosity levels for logging.

We've attempted the following troubleshooting steps:

  • Increased net.core.wmem_max and net.core.rmem_max values.
  • Restarted the Unbound service after making changes.

We would appreciate any guidance on how to further diagnose or resolve this issue. Please let me know if you need any additional information or logs to assist in troubleshooting.

@gthess gthess self-assigned this Nov 27, 2024
@gthess
Copy link
Member

gthess commented Nov 27, 2024

log-servfail does work for me here in both 1.19.2 and the latest version.
Is this perhaps some particular servfail case that does not get logged? Could you share some more information about that and maybe the configuration file (or the relevant bits at least)?

@shubhhq
Copy link
Author

shubhhq commented Nov 27, 2024

Yes. On further debugging we found out that we have

    serve-expired: yes
    serve-expired-ttl: 3600            # one hour, in seconds - max time till expired records are served
    serve-expired-client-timeout: 2000  # 2 seconds, in milliseconds - Query timeout after which expired record will be served

this configuration set.
On doing unbound-control set_option serve-expired no or making the serve-expired-client-timeout: 0 the servfails start getting logged again.
So we know a mitigation but not the root cause.

Not sure if this is the expected behaviour. Can you help finding what's happening under the hood?

Sample config

remote-control:
    control-enable: yes
    control-port: 8953

server:
    do-ip6: no
    do-not-query-localhost: no
    interface: 127.0.0.1
    access-control: 0.0.0.0/0 allow
    username: "unbound"
    module-config: "subnetcache validator iterator"
    send-client-subnet: 10.0.0.0/8
    max-client-subnet-ipv4: 20
    rrset-roundrobin: no
    serve-expired: yes
    serve-expired-ttl: 3600            # one hour, in seconds - max time till expired records are served
    serve-expired-client-timeout: 2000  # 2 seconds, in milliseconds - Query timeout after which expired record will be served
    infra-host-ttl: 120                # 2 minutes - Limit the time till a nameserver can be on the blacklist
    infra-keep-probing: yes            # Test servers on the blacklist for reachability one at a time
    prefetch: yes
    cache-min-ttl: 1
    infra-cache-min-rtt: 250
    num-threads: 4
    msg-cache-slabs: 4
    rrset-cache-slabs: 4
    infra-cache-slabs: 4
    key-cache-slabs: 4
    rrset-cache-size: 1000m
    msg-cache-size: 2000m
    so-rcvbuf: 4m
    so-sndbuf: 4m
    so-reuseport: yes
    edns-buffer-size: 4096
    max-udp-size: 4096
    use-syslog: no
    logfile: /etc/unbound/not_working_unbound.log
    log-servfail: yes
    log-replies: yes
    qname-minimisation: no

@gthess
Copy link
Member

gthess commented Nov 27, 2024

Are you coming from a specific version prior to 1.19.2 that this worked?

@shubhhq
Copy link
Author

shubhhq commented Nov 27, 2024

Yes, it did on 1.17.1

@gthess
Copy link
Member

gthess commented Nov 27, 2024

When serve-expired is enabled it will try to replace a SERVFAIL answer with (expired) data from the cache.
If that replace happens, there will be no logging output for the SERVFAIL that was replaced.
But this behavior was there since 1.10.0 already.

Did you also change settings when moving versions?

I think there is a bug there (since 1.10.0 already) that if picking a recently cached record (probably the same SERVFAIL that we try to replace) the log-servfail option would not work; that is with serve-expired enabled.

@shubhhq
Copy link
Author

shubhhq commented Nov 28, 2024

However, in our case we've not been getting the cached response (there's nothing in cache) and getting SERVFAIL as a response, which doesn't feel right.

@gthess
Copy link
Member

gthess commented Nov 29, 2024

SERVFAILs can also get cached as a backoff mechanism if resolution fails. And there is no useful record in the cache IIRC.
But the interesting question at the moment is if you also changed configuration between the versions, especially for the serve-expired options.

gthess added a commit that referenced this issue Dec 2, 2024
@gthess gthess closed this as completed in c55490c Dec 3, 2024
jedisct1 added a commit to jedisct1/unbound that referenced this issue Dec 26, 2024
* nlnet/master:
  - For NLnetLabs#1207: [FR] Support for RESINFO RRType 261 (RFC9606), add   LDNS_RR_TYPE_RESINFO similar to LDNS_RR_TYPE_TXT.
  Changelog entry for NLnetLabs#1204: - Merge NLnetLabs#1204: ci: set persist-credentials: false for actions/checkout   per zizmor suggestion.
  set persist-credentials: false per zizmor suggestion
  - Fix typo in log_servfail.tdir test.
  Changelog entry for NLnetLabs#1187: - Merge NLnetLabs#1187: Create the SSL_CTX for QUIC before chroot and privilege   drop.
  Create the SSL_CTX for QUIC before chroot and privilege drop (NLnetLabs#1187)
  - Safeguard alias loop while looking in the cache for expired answers.
  - Merge NLnetLabs#1198: Fix log-servfail with serve expired and no useful cache   contents.
  - For NLnetLabs#1175, the default value of serve-expired-ttl is set to 86400   (1 day) as suggested by RFC8767.
  Changelog entry for NLnetLabs#1189, NLnetLabs#1197: - Merge NLnetLabs#1189: Fix the dname_str method to cause conversion errors   when the domain name length is 255. - Merge NLnetLabs#1197: dname_str() fixes.
  - For NLnetLabs#1193, introduce log-servfail.tdir and cleanup the log-servfail   setting from other tests.
  - Fix NLnetLabs#1193: log-servfail fails to log host SERVFAIL responses in   Unbound 1.19.2 on Ubuntu 24.04.1 LTS, by not considering cached   failures when trying to reply with expired data.
  - For NLnetLabs#1189, homogenize the input buffer size for dname_str().
  - For NLnetLabs#1189, add unit tests for dname_str() and debug check the input   buffer size.
  Fix the dname_str method to cause conversion errors when the domain name length is 255
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants