Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd is skipping logs in kubernetes #366

Closed
mohitanchlia opened this issue Nov 5, 2019 · 12 comments
Closed

Fluentd is skipping logs in kubernetes #366

mohitanchlia opened this issue Nov 5, 2019 · 12 comments
Labels

Comments

@mohitanchlia
Copy link

Describe the bug
We run multiple kubernetes pods on the same node. Under low traffic there is generally not a problem but under high load when we generate 128MB size of logs file every 2 minutes fluentd starts to skip the logs. There are gaps in log ingestion. I tracked the last event and the first event in the logs and they correspond to the last log line of the file and the first log line of the next log file that it picks up, all other log files in between gets skipped. We have tons of CPU power available, ruby process is only at 50%.

We've tried almost all the settings described in other related thread, like watcher settings, rotate wait settings but nothing seems to help.

To Reproduce

 <source>
    @id containers.log
    @type tail
    @label @SPLUNK
    tag tail.containers.*
    path /var/log/containers/*.log
    refresh_interval 5s
    pos_file /var/log/splunk-fluentd-containers.log.pos
    path_key source
    read_from_head true
    rotate_wait 10s
    <parse>
      @type json
      time_key time
      time_type string
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      localtime false
    </parse>
  </source>

Have 2 similar pods on a 8 core node, drive load to generate 128 MB file size very 2 minutes. At this time you'll see POD A logs are ingested but no logs from POD B, and then after few mts POD B logs are ingested but no logs from POD A, and this behavior continues.

***** Only thing that worked is if I hard code individual containers in multiple tail input sections using multiple workers. This solution is not a viable solution because container ids are not known ahead of time.

Expected behavior
All the logs are forwarded without skipping them
Your Environment
fluentd 1.4
Amazon Linux 2

Fluentd or td-agent version: fluentd --version or td-agent --version
Operating system: cat /etc/os-release
Kernel version: uname -r
If you hit the problem with older fluentd version, try latest version first.

Your Configuration

Your Error Log
No errors are noted. No warnings as well

@ganmacs
Copy link
Member

ganmacs commented Nov 5, 2019

this one? fluent/fluentd#2236 (comment)

@repeatedly
Copy link
Member

repeatedly commented Nov 5, 2019

I want to know this problem is k8s specific or not.
If you can reproduce this problem without k8s, could you write reproducible steps/actual example?

@mohitanchlia
Copy link
Author

mohitanchlia commented Nov 5, 2019

@repeatedly I can try and reproduce, however the fact that multiple tail sources with multiple workers work just fine leads me to believe fluentd is not able to keep up. However, ruby CPU is only 50%. It looks like with multiple pods it is time slicing the reads instead of using multiple threads to read every log file?

@mohitanchlia
Copy link
Author

Can I get some guidance on how to reproduce this in a standalone environment? We use splunk to visualize so it's easy to see when logs are skipped or not forwarded. We are pumping > 12K log lines per node per fluentd instance.

@repeatedly
Copy link
Member

To test fluentd locally, simple way is installing fluentd or td-agent and simulate your traffic.

https://docs.fluentd.org/installation

For in_tail, we sometimes use dummer or writing specific script for complicated case, e.g. rotation. I'm not sure how your application generate logs, so if dummer is not fit, writing specific script is better.

@mohitanchlia
Copy link
Author

And how do I verify the rate of ingestion to the output plugin? Send it to a file? Even then how do I verify? Are there any tools I can use or do I have to do the verification manually?

@repeatedly
Copy link
Member

There are several ways: flowcounter/flowcounter_simple plugins, prometheus plugin, file/your calculated script and more.

@penghuiy
Copy link

I encountered the same situation on GKE, drive load to generate 50 MB file size very 1 minute, and the docker rotation size is 30mb. we found that under normal circumstances, in_tail detects log rotation and outputs two logs at the same time:

2019-12-19 16:46:07 +0000 [info]: [fluentd-containers.log] detected rotation of /var/log/containers/notification-549bf6d97f-z6lt2_production_notification-1d2adf6306e58363b61ae8b9e1a3de45fb7c1ecd8f377f39cdbd88ed25cd0e7d.log; waiting 5 seconds
2019-12-19 16:46:07 +0000 [info]: [fluentd-containers.log] following tail of /var/log/containers/notification-549bf6d97f-z6lt2_production_notification-1d2adf6306e58363b61ae8b9e1a3de45fb7c1ecd8f377f39cdbd88ed25cd0e7d.log

But in the time of log skip, there is only the first log, and in the following few seconds, these two logs appear again

2019-12-19 16:47:27 +0000 [info]: [fluentd-containers.log] detected rotation of /var/log/containers/notification-549bf6d97f-z6lt2_production_notification-1d2adf6306e58363b61ae8b9e1a3de45fb7c1ecd8f377f39cdbd88ed25cd0e7d.log; waiting 5 seconds

2019-12-19 16:47:39 +0000 [info]: [fluentd-containers.log] detected rotation of /var/log/containers/notification-549bf6d97f-z6lt2_production_notification-1d2adf6306e58363b61ae8b9e1a3de45fb7c1ecd8f377f39cdbd88ed25cd0e7d.log; waiting 5 seconds
2019-12-19 16:47:39 +0000 [info]: [fluentd-containers.log] following tail of /var/log/containers/notification-549bf6d97f-z6lt2_production_notification-1d2adf6306e58363b61ae8b9e1a3de45fb7c1ecd8f377f39cdbd88ed25cd0e7d.log

Does not seem to detect rotations at the right time.

version:
fluentd: 1.2.4

@dengalebr
Copy link

Is there any solution/workaround for this issue? we have stuck with same issue and multi workers not helping as well for us.

We use SCK (Splunk connect for kubernetes), but I don't think that changes anything.

@takryo00
Copy link

takryo00 commented Nov 1, 2020

I have same problem, and I think this problem is caused by k8s logging mechanism.

  1. Containers write log to "/var/lib/docker/containers/*".
  2. kubelet make a symbolic link to log source stored on "/var/lib/docker/containers/*" to "/var/log/pods/~~/" and "/var/log/containers/~.log"
  3. fluentd configuration( we setup this ) and in_tail plugin is watching "/var/log/containers/~~.log".

Each symbolic links are made one by one of each pods.
This means in_tail plugin can not detect "source log rotation".

I think I have 2 ways to resolve this issue.(not tested.)

  1. Setup in_tail plugin to detect logs directly.( set the path to "/var/lib/docker/containers/~~".)
    If we take this way, we need to prepare some method to identify the logs we need to collect because there are all logs made by containers work on the node.

  2. Write log to other files directly from application. ( not use stdout/stderr )

image

@github-actions
Copy link

github-actions bot commented Mar 1, 2021

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Mar 1, 2021
@github-actions
Copy link

github-actions bot commented Apr 1, 2021

This issue was automatically closed because of stale in 30 days

@github-actions github-actions bot closed this as completed Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants