Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help debugging a td-agent problem #1384

Closed
donato opened this issue Dec 21, 2016 · 15 comments
Closed

Help debugging a td-agent problem #1384

donato opened this issue Dec 21, 2016 · 15 comments

Comments

@donato
Copy link

donato commented Dec 21, 2016

Hey all,

fluentd: 0.12.29
fluent-plugin-s3: 0.7.1
td: 0.10.29
OS: Ubuntu 14.04.5 LTS

The problem is that td-agent stopped outputting its logs. When this happened it began growing in size, over 3 days it was never able to flush them out. We tried to manually force it to, by sending it SIGUSR1 signals to both the child and parent process, but this did not work. Ultimately we had to kill the process and have data loss, in order to prevent harm to our other production systems.

  <source>
    type tail
    format /^(?<message>"(?<remote>[^ ]*)" (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] *"(?<method>\S+)(?: +(?<path>[^\"]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)") ?-?)$/
    time_format %d/%b/%Y:%H:%M:%S %z
    keep_time_key true
    path /mnt/ltv_track/access_log
    refresh_interval 60s
    read_from_head true
    read_lines_limit 1000
    pos_file /var/log/td-agent/api_info_log.pos
    rotate_wait 5s
    enable_watch_timer true
    tag s3.api.info
  </source>
  <filter s3.api.info>
    @type grep
    exclude1 agent ELB-HealthChecker
  </filter>
  <match s3.api.info>
    type s3
    s3_bucket ltvytics.com
    auto_create_bucket false
    s3_region us-east-1
    s3_object_key_format %{path}/%{time_slice}/image-pwww-PRD.longtailvideo.com.%{index}.log.%{file_extension}
    path current/incoming
    buffer_type memory
    buffer_chunk_limit 750m
    buffer_queue_limit 12
    flush_interval 10m
    flush_at_shutdown true
    retry_wait 1.0
    max_retry_wait 1800
    retry_limit 32
    disable_retry_limit false
    num_threads 1
    time_slice_format %Y/%m/%d/%H
    time_slice_wait 1m
    utc 
    format single_value
  </match>

Logs

 2016-12-17 15:27:10 +0000 [warn]: before_shutdown failed error="can't modify frozen NoMemoryError"
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write_to'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-s3-0.7.1/lib/fluent/plugin/out_s3.rb:389:in `compress'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-s3-0.7.1/lib/fluent/plugin/out_s3.rb:228:in `write'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:100:in `block in before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:96:in `before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:413:in `before_shutdown'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:169:in `block in run'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `synchronize'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `ensure in run'
  2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/output.rb:168:in `run'
# This is when we went in and started sending signals, to no avail
2016-12-21 23:07:28 +0000 [info]: force flushing buffered events
2016-12-21 23:10:35 +0000 [info]: force flushing buffered events
2016-12-21 23:16:37 +0000 [info]: force flushing buffered events
2016-12-21 23:16:46 +0000 [info]: force flushing buffered events
2016-12-21 23:18:48 +0000 [info]: force flushing buffered events
@tagomoris
Copy link
Member

tagomoris commented Dec 22, 2016

I want to make the problem clear:

  • "When this happened it began growing in size": what does the "size" mean? buffer chunk or disk usage? memory size? or any other?
  • Are there any errors in logs when this problem happen?
  • Did you try any configuration change?: (changing num_threads or any others)

@repeatedly
Copy link
Member

We want to know all error logs.

2016-12-17 15:27:10 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.29/lib/fluent/plugin/buf_memory.rb:61:in `write'

This error doesn't include error cause. I assume your logs have more error messages / stacktraces.

@donato
Copy link
Author

donato commented Dec 22, 2016

@tagomoris -

  • I was referring to resident memory. The service runs in an autoscaling group and td-agent typically takes up ~300MB of memory on other nodes, however this one was taking 2.7 GB.
  • Other errors, I don't think so. Other services on the same node are working properly. strace showed it in a loop of just sleeping/polling.
  • I have not tinkered with the configuration at all, for reference the current configuration is running successfully (for now?) on 30 identical instances.

@repeatedly - What's above is all we get from /var/log/td-agent/td-agent.log The system itself was healthy, so I'm not sure where else to look. I'll gladly dig up more, but if you have a clue on where to look, it will help.

@repeatedly
Copy link
Member

Hmm... could you upload your /var/log/td-agent/td-agent.log?

@donato
Copy link
Author

donato commented Dec 22, 2016

I noticed that I had missed a line in my copy/paste of the error, I fixed it above by adding this line to the top:

2016-12-17 15:27:10 +0000 [warn]: before_shutdown failed error="can't modify frozen NoMemoryError"

The full log file is included, but I removed IP address information by running grep -v "pattern not match" on it first.
td-agent.sanitized.txt

@tagomoris
Copy link
Member

I missed that you're configuring buffer_type memory.

Of course, it uses memory, upto 750MB * 12 (8GB) if output throughput is not enough.
I suppose that output flush throughput is slower than your data collection traffic.
Try to configure bigger num_threads.

(and it's not any bug or fluentd.)

@donato
Copy link
Author

donato commented Dec 22, 2016

@tagomoris - Thanks for the tip. I dug into it. Charts below are in EDT.

screen shot 2016-12-22 at 10 32 09 am

screen shot 2016-12-22 at 10 32 38 am

It looks like the output throughput dropped when it was still in a healthy memory-level at about 14:25 UTC. This lead to it beginning to hold onto memory until it errored out at 15:27 UTC in the logs, shown below:

screen shot 2016-12-22 at 10 44 07 am

Could the out of memory be a red herring? Perhaps the real issue is what caused output to suddenly stop being written.

@donato
Copy link
Author

donato commented Dec 28, 2016

@tagomoris - can we reopen this ticket, my above post explains how running out of memory was a symptom not a cause.

@repeatedly
Copy link
Member

I want to know the chronology of problem.
Do you find why output throughput is dropped?
Network issue, some operations affect fluentd or 1 thread can't handle your traffic?
OoM is the result of low output throughput, right?

@donato
Copy link
Author

donato commented Dec 28, 2016

 2016-12-17 14:25 UTC - network output drops (unknown)
 2016-12-17 14:45 UTC - memory approaches limit
 2016-12-17 15:27 UTC - explodes due to no memory error

I don't know why the output dropped, although I have been able to identify another node which is showing the same exact results! This time I've kept it alive to poke around a bit. Is there any way to poke around the s3 plugin to determine why it's not outputting?

edit I've attached the sigdump logs, but they don't seem very useful

sigdump-1661.txt
sigdump-1104.txt

@donato
Copy link
Author

donato commented Jan 3, 2017

@repeatedly @tagomoris - can we reopen this ticket until a resolution is found?

@repeatedly
Copy link
Member

@donato Yes.

I checked your sigdump result but does this contain actual child process result?
There is no S3 output and other plugins thread logs.

@donato
Copy link
Author

donato commented Jan 4, 2017

@repeatedly - I followed the directions here . Is there an alternative way to get better logs?

I've started thinking this could be an issue with logrotate compressing files, so I'm going to tinker with the "delaycompress" option. Have you seen this as a problem before?

@donato
Copy link
Author

donato commented Jan 6, 2017

@repeatedly @tagomoris - can we reopen this ticket until a resolution is found?

@repeatedly
Copy link
Member

repeatedly commented Jan 11, 2017

reopen is okay. I replied to you "yes" at the previous comment :)
Ah sorry, you can't re-open the issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants