Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat stops processing log entries after "Read line error: file inactive" #1974

Closed
draenor opened this issue Jul 6, 2016 · 9 comments
Closed
Labels

Comments

@draenor
Copy link

draenor commented Jul 6, 2016

Filebeat stops processing logs after a while in combination with log rolling, however I'm not sure if exclusively related to log rolling It seemed to happen unrelated to log rolling today (the other harvesters were still fine).

  • Version 1.2.3.
  • OS: WIndows Server 2012 R2
  • Steps to reproduce
    • filebeat.exe -c filebeat.yml
    • Wait (until log rolling)

Issue description

  • The state file does not update and no logs are sent from filebeat (i.e., it's halted indefinately)
  • This occurred after log rolling (but I'm unsure if exclusively so)
  • Two instances of FileBeat is reading the same log file and both experience the exact same problem, i.e., both get stuck on the exact same offset.
  • When restarting one of both filebeat instances the restarted instance continues to process log lines without problems

Filebeat log:

2016-07-05T15:43:41+02:00 INFO Harvester started for file: ..\logs\foo.log
2016-07-05T15:45:11+02:00 INFO Events sent: 1
2016-07-05T15:45:11+02:00 INFO Registry file updated. 2 states written.
...
2016-07-06T00:09:29+02:00 INFO Not resuming rotated file: ..\logs\foo.log
2016-07-06T00:09:29+02:00 INFO Harvester started for file: ..\logs\foo.log
2016-07-06T00:09:35+02:00 INFO Events sent: 1
2016-07-06T00:09:35+02:00 INFO Registry file updated. 5 states written.
2016-07-06T01:07:08+02:00 INFO Read line error: file inactive
2016-07-06T01:08:51+02:00 INFO Read line error: file inactive

filebeat.yml

filebeat:
  registry_file: .filebeat_registry
  prospectors:
    - paths:
        - "${LOGDIR}/foo.log"
        - "${LOGDIR}/bar.log"
      input_type: log
      document_type: mytype

output:
  logstash:
    hosts: ["host.com:5044"]

shipper:
logging:
  to_syslog: false
  to_files: true

  files:
    path: "${LOGDIR}/filebeat/"
    name: filebeat.log
    rotateeverybytes: 10485760 # = 10MB
    keepfiles: 7
  level: info

If it matters, the crawled logs are generated and rolled by the java logback library.

@tsg
Copy link
Contributor

tsg commented Jul 8, 2016

The file inactive error means that Filebeat thinks the file wasn't modified for a long time (1h by default) so it closes it. This discussion seems similar to what you are experiencing.

If it's indeed the same bug, there are two things that you could try:

  • setting a very large value for close_older
  • the 5.0.0-alpha3 release, which should fix this

@ruflin might have more insights on this.

@ruflin
Copy link
Contributor

ruflin commented Jul 11, 2016

If I understand this right, filebeat gets stuck either at the offset when the file got rotated or an offset when the close_older is reached? It seems like the internal state is not properly updated, means the harvester does not properly communicate that the crawling was finished.

The state part was completely rewritten in Filebeat 5.0.0-*. It would be really helpful if you could try the 5.0.0 snapshot to see if this solves the problem: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

@draenor
Copy link
Author

draenor commented Jul 11, 2016

As we can see from the logs there were at least one event sent after log rolling, however I can try the latest 5 snapshot to see if the problem appear again within a couple of days.

@ruflin
Copy link
Contributor

ruflin commented Jul 12, 2016

@draenor That would be great. Keep me posted.

@draenor
Copy link
Author

draenor commented Jul 12, 2016

I've been running version 1.2.3 in parallel to the version 5.0.0 snapshot. Looking at the registry file for the same prospector there's this

  • version 1.2.3: "offset":3141
  • version 5.0.0 snapshot: "offset":44309

The logs from 1.2.3

...
2016-07-12T05:25:10+02:00 INFO Events sent: 2
2016-07-12T05:25:10+02:00 INFO Registry file updated. 5 states written.
2016-07-12T06:24:49+02:00 INFO Read line error: file inactive
2016-07-12T06:25:12+02:00 INFO Read line error: file inactive
2016-07-12T11:27:28+02:00 INFO Not resuming rotated file: ..\logs\foo.log
2016-07-12T11:27:28+02:00 INFO Harvester started for file: ..\logs\foo.log
2016-07-12T11:27:28+02:00 INFO Registry file updated. 5 states written.
2016-07-12T11:27:28+02:00 INFO Events sent: 1
2016-07-12T11:27:28+02:00 INFO Registry file updated. 5 states written.
2016-07-12T12:27:33+02:00 INFO Read line error: file inactive

And the last log entry from version 5.0.0 (which judging by the timestamp continues where 1.2.3 doesn't)

2016-07-12T13:08:46+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1665 registrar.state_updates=4 libbeat.logstash.published_and_acked_events=1 libbeat.publisher.published_events=1 publish.events=2 libbeat.logstash.publish.read_bytes=6

Thoughts

  • Version 5 so far doesn't seem to suffer from this issue
  • Next, I'll try to set a higher value for close_older in the 1.2.3 version (24h) to see if that helps. I'll keep you posted on the results.
  • With 5 I get this 2016-07-12T13:37:55+02:00 ERR Error: The service process could not connect to the service controller. but it doesn't seem to have any impact.

@ruflin
Copy link
Contributor

ruflin commented Jul 12, 2016

It could be that you hit the following issue which was fixed in 5.0: #1954

About the error above: Not 100% sure where this error is coming from. Some Googleing revelaed it could be Windows related. @andrewkroh Do you know here more?

Thanks for doing all the testing.

@draenor
Copy link
Author

draenor commented Jul 15, 2016

With the setting close_older: 24h I still experience file inactive errors (even though logs are set to roll daily).

The 5.0.0 snapshot does not seem to suffer from the same issue.

@ruflin
Copy link
Contributor

ruflin commented Jul 18, 2016

@draenor Thanks for testing. I have the suspicion that the crawling hangs because of the Return channel not receiving data: https://github.com/elastic/beats/blob/1.3/filebeat/crawler/prospector.go#L461 This code does not exist anymore in 5.0. I will try to create some test cases to reliably reproduce this locally to find a fix.

@ruflin
Copy link
Contributor

ruflin commented Sep 2, 2016

@draenor I was playing around with this and tried to fix it in 1.3. Unfortunately the fix introduced other issues which is the reason we never merged #2274 To fix the issue completely the state rewrite which was done for 5.0 is needed.

I'm closing this issue as I think it is fixed in 5.0, but please reopen if you discover it also in 5.0 or need more details.

Thanks a lot for investigating this in detail and providing all the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants