Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless wait loop between threads resulting in go-dnscollector (0.45+) to hang #729

Closed
romainw opened this issue Jun 4, 2024 · 7 comments · Fixed by #732
Closed

Endless wait loop between threads resulting in go-dnscollector (0.45+) to hang #729

romainw opened this issue Jun 4, 2024 · 7 comments · Fixed by #732
Labels
bug Something isn't working
Milestone

Comments

@romainw
Copy link

romainw commented Jun 4, 2024

When upgrading from 0.41 to 0.45/0.46b, go-dnscollector is now hanging when processing files.
The flow works well for tiny fstrm files, but as soon as the file reaches a certain size apparently, this creates a condition between the different threads causing an endless wait loop.

If we take the first 50 000 lines of the file below (dnstap-2024-06-03-19:56:54.fstrm), the file is successfully processed and then deleted.
If we take the first 100 000 lines of the very same file (for a total of 30MB), go-dnscollector will hang whilst processing the file, and never complete this work or subsequently delete it:

/usr/bin/go-dnscollector -config /etc/dnscollector/config.yml  
INFO: 2024/06/04 16:39:40.307068 worker - [dnstap-server1] fileingestor - file ready to process /var/pdnssoc_input/server1/dnstap-2024-06-03-19:56:54.fstrm
INFO: 2024/06/04 16:39:40.307282 worker - [dnstap-server1] fileingestor - processing dnstap file [dnstap-2024-06-03-19:56:54.fstrm]

Running strace at this point will show this loop (note the "Connection timed out"):

strace: Process 59329 attached with 9 threads
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59336] futex(0xc000306148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59335] futex(0x2686ed8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59334] futex(0x26870e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59330] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 59332] futex(0xc000073148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59329] epoll_pwait(4, 

 <unfinished ...>
[pid 59330] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59329] <... epoll_pwait resumed>[], 128, 5768, NULL, 0) = 0
[pid 59329] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59329] <... futex resumed>)        = 1
[pid 59337] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] <... futex resumed>)        = 1
[pid 59333] <... futex resumed>)        = 0
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] epoll_pwait(4,  <unfinished ...>
[pid 59329] epoll_pwait(4,  <unfinished ...>
[pid 59337] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] futex(0x20fd420, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=994486541}) = -1 ETIMEDOUT (Connection timed out)
[pid 59330] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] <... futex resumed>)        = 0
[pid 59330] <... futex resumed>)        = 1
[pid 59333] sched_yield( <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59333] <... sched_yield resumed>)  = 0
[pid 59333] futex(0x20fd338, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 59333] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59333] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 1
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] nanosleep({tv_sec=0, tv_nsec=3000},  <unfinished ...>
[pid 59337] <... futex resumed>)        = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] <... nanosleep resumed>NULL) = 0
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59329] <... epoll_pwait resumed>[], 128, 9999, NULL, 0) = 0
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59329] epoll_pwait(4,  <unfinished ...>
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824991665720
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59337] rt_sigreturn({mask=[]})     = 825108996752
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 825078279520
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 825046974064
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824867546816
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824757653032
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 824645210520
[pid 59337] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59333] <... futex resumed>)        = 0
[pid 59333] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59337] futex(0x20fd438, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=100000} <unfinished ...>
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 59337] getpid( <unfinished ...>
[pid 59333] futex(0x20fd438, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... getpid resumed>)       = 59329
[pid 59333] <... futex resumed>)        = 0
[pid 59333] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59337] tgkill(59329, 59333, SIGURG) = 0
[pid 59333] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] rt_sigreturn({mask=[]})     = 824815153384
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59333, SIGURG <unfinished ...>
[pid 59333] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59333] rt_sigreturn({mask=[]})     = 824908148104
[pid 59333] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 1
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] <... futex resumed>)        = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] sched_yield( <unfinished ...>
[pid 59337] futex(0x20fd438, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=100000} <unfinished ...>
[pid 59331] <... sched_yield resumed>)  = 0
[pid 59331] futex(0x20fd438, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59331] <... futex resumed>)        = 1
[pid 59337] sched_yield()               = 0
[pid 59331] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] futex(0x20fd338, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59331] <... futex resumed>)        = 1
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 0
[pid 59331] close(9 <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59331] <... close resumed>)        = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRI
<snip>

Trying to exit the go-dnscollector at this point with CTRL+C is also hanging, requiring a full kill -9 of the process:

^CWARNING: 2024/06/04 16:50:38.928759 main - exiting...
INFO: 2024/06/04 16:50:38.928798 worker - [dnstap-server1] fileingestor - stopping monitor...
INFO: 2024/06/04 16:50:38.928811 worker - [dnstap-server1] fileingestor - monitor terminated
INFO: 2024/06/04 16:50:38.928822 worker - [dnstap-server1] fileingestor - stopping collect...
INFO: 2024/06/04 16:50:38.928831 worker - [dnstap-server1] fileingestor - stop to listen...
INFO: 2024/06/04 16:50:38.928970 worker - [dnstap-server1] dns processor - stopping monitor...
INFO: 2024/06/04 16:50:38.928986 worker - [dnstap-server1] dns processor - monitor terminated
INFO: 2024/06/04 16:50:38.928995 worker - [dnstap-server1] dns processor - stopping collect...
INFO: 2024/06/04 16:50:38.929004 worker - [dnstap-server1] dns processor - collection terminated
INFO: 2024/06/04 16:50:38.929017 worker - [dnstap-server1] (conn #0) dnstap processor - stopping monitor...
INFO: 2024/06/04 16:50:38.929025 worker - [dnstap-server1] (conn #0) dnstap processor - monitor terminated
INFO: 2024/06/04 16:50:38.929038 worker - [dnstap-server1] (conn #0) dnstap processor - stopping collect...

Just replacing the go-dnscollector 0.45/0.46b binary with version 0.41 "solves" this issue and the fstrm file is processed instantly and then successfully deleted.

@romainw romainw changed the title Endless wait loop between thread resulting in go-dnscollector (0.45+) to hang Endless wait loop between threads resulting in go-dnscollector (0.45+) to hang Jun 4, 2024
@dmachard dmachard added the bug Something isn't working label Jun 4, 2024
@dmachard
Copy link
Owner

dmachard commented Jun 4, 2024

Thank you for reporting this issue.

Could you please provide more details, such as:

  • The complete configuration file?
  • The dnstap-2024-06-03-19:56:54.fstrm file?
  • Have you tried versions 0.42, v0.43, and 0.44? This information will help us identify the version where the regression was introduced.

@romainw
Copy link
Author

romainw commented Jun 4, 2024

Many thanks for the quick reply.

################################################
# Collector configuration
################################################
global:
  #pid-file: "/var/dnscollector/collector.pid"
  trace:
    verbose: true
  server-identity: "pDNSSOC_C"
  text-format: "timestamp-rfc3339ns qr identity operation rcode queryip queryport protocol qname qtype name"
  # default text field delimiter
  text-format-delimiter: " "
  # default text field boundary
  text-format-boundary: "\""

pipelines:
 - name: dnstap-server1
   file-ingestor:
        watch-dir: /var/pdnssoc_input/server1
        watch-mode: dnstap
        delete-after: true
   transforms:
     filtering:
        log-queries: true
        log-replies: true
        drop-queryip-file: /root/dns_servers.txt
   routing-policy:
          forward: [ filelogdomains, filelogips, fileall ]      
 - name: filelogdomains
   logfile:
    file-path: /var/dnscollector/matches/matches_domains.json
    mode: json
   transforms:
     filtering:
      keep-fqdn-file: '/var/dnscollector/misp_domains.txt'

 - name: filelogips
   logfile:
        file-path: /var/dnscollector/matches/matches_ips.json
        mode: json
   transforms:
     filtering:
      keep-rdata-file: '/var/dnscollector/misp_ips.txt'

 - name: fileall
   logfile:
        file-path: /var/dnscollector/queries/queries.json
        mode: json
        flush-interval: 1
        max-size: 100
        max-files: 10
        chan-buffer-size: 65535
          #postrotate-command: "/var/dnscollector/postrotate_query.sh"
        postrotate-delete-success: true

(this is the configuration file upgraded for 0.46b)

The issue appeared with 0.45.0. In other words, 0.44.0 is the last working version for us.

Thanks again!

@dmachard
Copy link
Owner

dmachard commented Jun 5, 2024

Thank for sharing config,
A refactoring of the transformers was done in v0.45.0, I suspect something wrong on it.
Could you test without the filtering transform ?

Can you share your frstm file ? it will be more easy to reproduce in my side.

@dmachard
Copy link
Owner

dmachard commented Jun 5, 2024

I found the regression, the dnstap packet process is never started... with the ingestor.
I will add a test for this use case to avoid this regression

@romainw
Copy link
Author

romainw commented Jun 5, 2024

wow, that was quick! Good work @dmachard!

I cannot easily share the dnstap file, which is coming from our production systems.
However, I can/will try any test or beta release to confirm the issue is fixed.

Thanks again!

dmachard added a commit that referenced this issue Jun 5, 2024
@dmachard dmachard modified the milestone: v0.46.0 Jun 5, 2024
dmachard added a commit that referenced this issue Jun 5, 2024
@dmachard
Copy link
Owner

dmachard commented Jun 5, 2024

fix pushed in master branch and release v0.46.0-beta2

@dmachard dmachard reopened this Jun 5, 2024
@romainw
Copy link
Author

romainw commented Jun 6, 2024

This is just to confirm that the issue has been fixed successfully starting 0.46.0-beta2. Thank you!

@romainw romainw closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants