Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an new input type to backfill gzipped logs #637

Open
lminaudier opened this issue Jan 6, 2016 · 120 comments
Open

Add an new input type to backfill gzipped logs #637

lminaudier opened this issue Jan 6, 2016 · 120 comments
Assignees
Labels
enhancement Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@lminaudier
Copy link

Hi,

Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.

The proposed solution mentioned in the topic is to add a new dedicated input_type.

It is also mentioned in the topic that when filebeat reaches the end of input on stdin it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.

What are your thoughts on this ?

Thanks for your hard work.

@ruflin
Copy link
Contributor

ruflin commented Jan 6, 2016

I would see the implementation as following:

  • Prospector: A gzip prospector would be added. Harvesters would only be opened based on filenames (no inode etc.). It is assume, that the files are not renamed and don't have to be tracked. A file has a completion state and if it is once completed, it is never read again. This simplifies the implementation. If a .gz file with a new filename is found, a harvester is started.
  • Harvester: A gzip harvester is added. It unzips and reads the full file only once. After finishing reading the file, the harvester stops and is never started again. The offset is only stored, if it is interrupted in the middle of reading. In a first implementation, this could even be removed for simplicity.

This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome.

For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219

@lminaudier
Copy link
Author

Thanks for the fast reply and the pointer to the issue.

I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :)

@ruflin
Copy link
Contributor

ruflin commented Jan 6, 2016

@lminaudier Always here to help.

@monicasarbu monicasarbu added the Filebeat Filebeat label Jan 6, 2016
@mryanb
Copy link

mryanb commented Jan 21, 2016

This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically.

@Ragsboss
Copy link

Ragsboss commented Aug 1, 2016

+1. Is anyone working on this? If not I could possibly take it up..

@ruflin
Copy link
Contributor

ruflin commented Aug 2, 2016

@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details.

@cFire
Copy link

cFire commented Aug 5, 2016

@Ragsboss @ruflin I'm delighted to see there's someone looking to pick this up. Is this happening? The reason I ask is because it may be possible for me to spend some time helping out with this in lieu of building another solution for gziped logs to use internally.

@Ragsboss
Copy link

Ragsboss commented Aug 5, 2016

@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know.
Few thoughts I had. From a pure functional viewpoint, defining a new input_type doesn't seem ideal as that would force users to author a new prospector in the config file. Instead I felt it may be better for the code to automatically deal with compressed files as long as they match the given filename patterns in the config file. The code can instantiate a different harvester (IIUC) based on the file extension/type. But from implementation viewpoint if this is turning out to be difficult, I think it's ok to burden/ask the users for some extra config...

@ruflin
Copy link
Contributor

ruflin commented Aug 8, 2016

From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry.

Ragsboss pushed a commit to Ragsboss/beats that referenced this issue Aug 10, 2016
This is a proposed approach to fix elastic#637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work.

Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path..

Few pending things that we can do once we agree this approach is acceptable
- Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched.
- Write new tests
- Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this
Ragsboss pushed a commit to Ragsboss/beats that referenced this issue Aug 11, 2016
…ality in filebeat

elastic#637

The test has two gzip files - one that covers the gzip size > raw file size and another that covers the gzip file < raw file size.

This test is failing as I'm seeing the nasa-360.log.gz is producing duplicated events. Fix for this is coming in next commit.
Ragsboss pushed a commit to Ragsboss/beats that referenced this issue Aug 11, 2016
I'm not happy about this change, but I'm pushing it temporarily as I'm switching jobs and hence my laptops. The problem is when the gzip file is complete and next scan finds the state, the offset in state is greater than size of gzip file and we treat it as truncated file and reindex from beginning. Fix is to add a new ActualSize method to LogSource interface and have GZipFile return infinity as the actual size. This way at-least scan will try to seek and fail for gzip-ed files thus not doing anything..

I'm thinking a better fix is to modify the state structure to store the fact that we reached EOF on a non-Continuable source, so the propsector can be smart about skipping such files..

elastic#637
@ruflin
Copy link
Contributor

ruflin commented Aug 16, 2016

Here is the PR related to the above discussions: #2227

@willsheppard
Copy link

willsheppard commented Sep 5, 2016

We would like filebeat to be able to read gzipped files, too.

Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch.

No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable.

@ruflin
Copy link
Contributor

ruflin commented Sep 6, 2016

@willsheppard Thanks for the details. For the batch mode you will be interested in #2456 (in progress)

@ruflin
Copy link
Contributor

ruflin commented Sep 16, 2016

Now that #2456 is merged this feature got even more interesting :-)

@collabccie7
Copy link

Hello,

Has there been any update regarding the support for gzip files ? Please let know.

Thanks.

@cFire
Copy link

cFire commented Oct 18, 2016

Idk about the others, but I've not gotten any time to work on this.

@ruflin
Copy link
Contributor

ruflin commented Oct 24, 2016

No progress yet on this from my side.

@willsheppard
Copy link

This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters.

@maddazzaofdudley
Copy link

Has there been any movement on this as this too is the killer feature for me too

@woodchalk
Copy link

Throwing in my support for this feature.

@plumpNation
Copy link

Would be an awesome feature to have.

@ruflin
Copy link
Contributor

ruflin commented Jan 30, 2017

There is this open PR here that still needs some work and also involves quite some discussions: #3070

@jordansissel
Copy link
Contributor

Harvesters would only be opened based on filenames (no inode etc.).

@ruflin I believe inodes may to be tracked because logrotate (assuming this is a target use case) renames files and reuses file names. Unless another tracking mechanism (when is 'hello.txt.1.gz' a "new file" below, for example)

Example:

% ls -il /tmp/hello.txt*
103196 -rw-rw-r--. 1 jls jls 12 Jan 24 03:17 /tmp/hello.txt
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.1.gz

% cat test.conf
/tmp/hello.txt {
  rotate 5
  compress
}

% logrotate -s /tmp/example.logrotate -f test.conf

% ls -il /tmp/hello.txt*
103218 -rw-rw-r--. 1 jls jls 32 Jan 24 03:17 /tmp/hello.txt.1.gz
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.2.gz

^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'.

We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right?

@ruflin
Copy link
Contributor

ruflin commented Jan 31, 2017

I exactly hit this issues during the implementation. That is why the implementation in #3070 is not fully consistent with the initial theory in this thread.

The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again.

@ITSEC-Hescalona
Copy link

+1 for gzip please

@notzippy
Copy link

notzippy commented May 7, 2020

As an alternative, you can use mkfifo and pipe that stream into filebeat and pipe all your files into that pipe. That way you do not need to worry when you are finished ingesting a file because you can just automatically start piping the next file into it. So my zcats go into zcat ... >fb input and in another terminal, I have filebeat < fb_input

@bunnHHA
Copy link

bunnHHA commented Jun 19, 2020

any update on gzip files? Using zcat solve some issues but it does need to be done in manual all the time.

@jsoriano jsoriano added the Team:Services (Deprecated) Label for the former Integrations-Services team label Nov 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@dwdixon
Copy link

dwdixon commented Nov 19, 2020

+1 to adding this feature...is there any update on the timeline for this being added to Filebeat?

Edit: Sorry for "spam" just saw that comment... : )

@cancer13
Copy link

+1
any update?

@NemchinovSergey
Copy link

NemchinovSergey commented Nov 6, 2021

+1
parsing gzip would be great!
files with plain text are huge!

@arasic
Copy link

arasic commented Dec 12, 2021

Hello everyone,

The feature has not been implemented even if it has been requested by many users.
The ticket #2227 was closed because there was no progress for some time. Some progess has been made, but I don't know whats remaining.

Not sure to understand if the feature was not implemented, is it because its a challenging problem or because lack of interest.

From what I understood, there are two difficulties:

  • difficult to track/keep the offest of what has already been processed
  • the files can be larger when extracted, like 50GBs and more?
  • Did I miss any other challenging question?

So now, we realized that not only we cannot process the .gz files, but they should be excluded in exclude_files: [".gz"], otherwise we will get scrambled text.

The most helpful thing I found was posted by Vasek Šulc:
https://discuss.elastic.co/t/filebeat-on-demand-loading-compressed-gz-files-from-stdin/184159, which is to use the stdin configurations in filebeat, and redirect or pipe the compressed file outputs using zcat.

@jlind23 jlind23 removed Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team labels Mar 30, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 30, 2022
@botelastic
Copy link

botelastic bot commented Mar 30, 2022

This issue doesn't have a Team:<team> label.

@jlind23 jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 30, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Collaborator

jlind23 commented Mar 30, 2022

@nimarezainia From a pure product perspective, how does this issue fit in our actual roadmap and strategy?

@nerophon
Copy link

Competitors have this feature, e.g. Splunk forwarders, and also Graphana Loki: grafana/loki#5956.

@nimarezainia
Copy link
Contributor

this is tracked i the enhancements repo how ever not slated for development due to other higher priority items on the list.

@fzyzcjy
Copy link

fzyzcjy commented Mar 18, 2023

Hi, is there any updates? Thanks

@nimarezainia
Copy link
Contributor

Hi, is there any updates? Thanks

@fzyzcjy right now we don't have a development date for this.

@mathemaphysics
Copy link

This would be fabulous to have this feature. Accessing files directly would allow reading of log files directly without decompression of the file hierarchy. Text compression is generally very good, so this would be in line with preservation of disk space.

@matschaffer-roblox
Copy link

matschaffer-roblox commented Dec 10, 2023

@jlind23 @nimarezainia incase it's helpful here's how I found this issue today:

We use a lot of AWS EMR where Amazon handles ingest of some logs (usually enough) but not all. Last weekend we hit a problem with one of the services in the mix that weren't ingested. We started a new EMR deployment, scp'd most of the logs (some raw, some gz) from the old deployment and terminated it. Now I'd like to get the old logs ingested for analysis.

ML's Data Visualizer does a great job on the raw files and even spits out a filebeat config for me. It'd be impressive if filebeat just worked for the .gz'd logs as well so I could slurp up everything I scp'd.

Tangentially having a filebeat option for https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html would also make this task easier.

Being able to quickly ingest and analyze a handful of local logs could be a nice on-ramp to continual ingestion as well, provided all the artifacts (filebeat config, ingest pipeline) were easily portable to something like a cloud deployment.

@botelastic
Copy link

botelastic bot commented Dec 9, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Dec 9, 2024
@philhagen
Copy link

This would be an immensely helpful feature. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.