Help debugging a td-agent problem #1384

donato · 2016-12-21T23:45:33Z

Hey all,

fluentd: 0.12.29
fluent-plugin-s3: 0.7.1
td: 0.10.29
OS: Ubuntu 14.04.5 LTS

The problem is that td-agent stopped outputting its logs. When this happened it began growing in size, over 3 days it was never able to flush them out. We tried to manually force it to, by sending it SIGUSR1 signals to both the child and parent process, but this did not work. Ultimately we had to kill the process and have data loss, in order to prevent harm to our other production systems.

  <source>
    type tail
    format /^(?<message>"(?<remote>[^ ]*)" (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] *"(?<method>\S+)(?: +(?<path>[^\"]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)") ?-?)$/
    time_format %d/%b/%Y:%H:%M:%S %z
    keep_time_key true
    path /mnt/ltv_track/access_log
    refresh_interval 60s
    read_from_head true
    read_lines_limit 1000
    pos_file /var/log/td-agent/api_info_log.pos
    rotate_wait 5s
    enable_watch_timer true
    tag s3.api.info
  </source>
  <filter s3.api.info>
    @type grep
    exclude1 agent ELB-HealthChecker
  </filter>
  <match s3.api.info>
    type s3
    s3_bucket ltvytics.com
    auto_create_bucket false
    s3_region us-east-1
    s3_object_key_format %{path}/%{time_slice}/image-pwww-PRD.longtailvideo.com.%{index}.log.%{file_extension}
    path current/incoming
    buffer_type memory
    buffer_chunk_limit 750m
    buffer_queue_limit 12
    flush_interval 10m
    flush_at_shutdown true
    retry_wait 1.0
    max_retry_wait 1800
    retry_limit 32
    disable_retry_limit false
    num_threads 1
    time_slice_format %Y/%m/%d/%H
    time_slice_wait 1m
    utc 
    format single_value
  </match>

Logs

 2016-12-17 15:27:10 +0000 [warn]: before_shutdown failed error="can't modify frozen NoMemoryError"
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write_to'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-s3-0.7.1/lib/fluent/plugin/out_s3.rb:389:in `compress'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-s3-0.7.1/lib/fluent/plugin/out_s3.rb:228:in `write'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:100:in `block in before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:96:in `before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:413:in `before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:169:in `block in run'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `synchronize'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `ensure in run'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `run'
# This is when we went in and started sending signals, to no avail
2016-12-21 23:07:28 +0000 [info]: force flushing buffered events
2016-12-21 23:10:35 +0000 [info]: force flushing buffered events
2016-12-21 23:16:37 +0000 [info]: force flushing buffered events
2016-12-21 23:16:46 +0000 [info]: force flushing buffered events
2016-12-21 23:18:48 +0000 [info]: force flushing buffered events

The text was updated successfully, but these errors were encountered:

tagomoris · 2016-12-22T01:40:49Z

I want to make the problem clear:

"When this happened it began growing in size": what does the "size" mean? buffer chunk or disk usage? memory size? or any other?
Are there any errors in logs when this problem happen?
Did you try any configuration change?: (changing num_threads or any others)

repeatedly · 2016-12-22T02:12:34Z

We want to know all error logs.

2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write'

This error doesn't include error cause. I assume your logs have more error messages / stacktraces.

donato · 2016-12-22T05:32:36Z

@tagomoris -

I was referring to resident memory. The service runs in an autoscaling group and td-agent typically takes up ~300MB of memory on other nodes, however this one was taking 2.7 GB.
Other errors, I don't think so. Other services on the same node are working properly. strace showed it in a loop of just sleeping/polling.
I have not tinkered with the configuration at all, for reference the current configuration is running successfully (for now?) on 30 identical instances.

@repeatedly - What's above is all we get from /var/log/td-agent/td-agent.log The system itself was healthy, so I'm not sure where else to look. I'll gladly dig up more, but if you have a clue on where to look, it will help.

repeatedly · 2016-12-22T05:35:59Z

Hmm... could you upload your /var/log/td-agent/td-agent.log?

donato · 2016-12-22T05:58:07Z

I noticed that I had missed a line in my copy/paste of the error, I fixed it above by adding this line to the top:

2016-12-17 15:27:10 +0000 [warn]: before_shutdown failed error="can't modify frozen NoMemoryError"

The full log file is included, but I removed IP address information by running grep -v "pattern not match" on it first.
td-agent.sanitized.txt

tagomoris · 2016-12-22T10:03:52Z

I missed that you're configuring buffer_type memory.

Of course, it uses memory, upto 750MB * 12 (8GB) if output throughput is not enough.
I suppose that output flush throughput is slower than your data collection traffic.
Try to configure bigger num_threads.

(and it's not any bug or fluentd.)

donato · 2016-12-22T15:48:01Z

@tagomoris - Thanks for the tip. I dug into it. Charts below are in EDT.

It looks like the output throughput dropped when it was still in a healthy memory-level at about 14:25 UTC. This lead to it beginning to hold onto memory until it errored out at 15:27 UTC in the logs, shown below:

Could the out of memory be a red herring? Perhaps the real issue is what caused output to suddenly stop being written.

donato · 2016-12-28T14:12:32Z

@tagomoris - can we reopen this ticket, my above post explains how running out of memory was a symptom not a cause.

repeatedly · 2016-12-28T15:06:20Z

I want to know the chronology of problem.
Do you find why output throughput is dropped?
Network issue, some operations affect fluentd or 1 thread can't handle your traffic?
OoM is the result of low output throughput, right?

donato · 2016-12-28T20:04:46Z

 2016-12-17 14:25 UTC - network output drops (unknown)
 2016-12-17 14:45 UTC - memory approaches limit
 2016-12-17 15:27 UTC - explodes due to no memory error

I don't know why the output dropped, although I have been able to identify another node which is showing the same exact results! This time I've kept it alive to poke around a bit. Is there any way to poke around the s3 plugin to determine why it's not outputting?

edit I've attached the sigdump logs, but they don't seem very useful

sigdump-1661.txt
sigdump-1104.txt

donato · 2017-01-03T14:44:27Z

@repeatedly @tagomoris - can we reopen this ticket until a resolution is found?

repeatedly · 2017-01-03T23:01:26Z

@donato Yes.

I checked your sigdump result but does this contain actual child process result?
There is no S3 output and other plugins thread logs.

donato · 2017-01-04T17:57:20Z

@repeatedly - I followed the directions here . Is there an alternative way to get better logs?

I've started thinking this could be an issue with logrotate compressing files, so I'm going to tinker with the "delaycompress" option. Have you seen this as a problem before?

donato · 2017-01-06T20:45:38Z

@repeatedly @tagomoris - can we reopen this ticket until a resolution is found?

repeatedly · 2017-01-11T00:51:14Z

reopen is okay. I replied to you "yes" at the previous comment :)
Ah sorry, you can't re-open the issue...

tagomoris closed this as completed Dec 22, 2016

hjet mentioned this issue Jan 10, 2017

td-agent memory usage gradually creeping upwards #1414

Closed

repeatedly reopened this Jan 11, 2017

donato closed this as completed Feb 28, 2017

sammerry mentioned this issue Jul 9, 2017

Fluentd Hangs With No Stacktrace #1621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help debugging a td-agent problem #1384

Help debugging a td-agent problem #1384

donato commented Dec 21, 2016 •

edited

Loading

tagomoris commented Dec 22, 2016 •

edited

Loading

repeatedly commented Dec 22, 2016

donato commented Dec 22, 2016

repeatedly commented Dec 22, 2016

donato commented Dec 22, 2016

tagomoris commented Dec 22, 2016

donato commented Dec 22, 2016

donato commented Dec 28, 2016

repeatedly commented Dec 28, 2016

donato commented Dec 28, 2016 •

edited

Loading

donato commented Jan 3, 2017

repeatedly commented Jan 3, 2017

donato commented Jan 4, 2017 •

edited

Loading

donato commented Jan 6, 2017

repeatedly commented Jan 11, 2017 •

edited

Loading

Help debugging a td-agent problem #1384

Help debugging a td-agent problem #1384

Comments

donato commented Dec 21, 2016 • edited Loading

tagomoris commented Dec 22, 2016 • edited Loading

repeatedly commented Dec 22, 2016

donato commented Dec 22, 2016

repeatedly commented Dec 22, 2016

donato commented Dec 22, 2016

tagomoris commented Dec 22, 2016

donato commented Dec 22, 2016

donato commented Dec 28, 2016

repeatedly commented Dec 28, 2016

donato commented Dec 28, 2016 • edited Loading

donato commented Jan 3, 2017

repeatedly commented Jan 3, 2017

donato commented Jan 4, 2017 • edited Loading

donato commented Jan 6, 2017

repeatedly commented Jan 11, 2017 • edited Loading

donato commented Dec 21, 2016 •

edited

Loading

tagomoris commented Dec 22, 2016 •

edited

Loading

donato commented Dec 28, 2016 •

edited

Loading

donato commented Jan 4, 2017 •

edited

Loading

repeatedly commented Jan 11, 2017 •

edited

Loading