Add an new input type to backfill gzipped logs #637

lminaudier · 2016-01-06T08:59:29Z

Hi,

Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.

The proposed solution mentioned in the topic is to add a new dedicated input_type.

It is also mentioned in the topic that when filebeat reaches the end of input on stdin it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.

What are your thoughts on this ?

Thanks for your hard work.

The text was updated successfully, but these errors were encountered:

ruflin · 2016-01-06T09:10:38Z

I would see the implementation as following:

Prospector: A gzip prospector would be added. Harvesters would only be opened based on filenames (no inode etc.). It is assume, that the files are not renamed and don't have to be tracked. A file has a completion state and if it is once completed, it is never read again. This simplifies the implementation. If a .gz file with a new filename is found, a harvester is started.
Harvester: A gzip harvester is added. It unzips and reads the full file only once. After finishing reading the file, the harvester stops and is never started again. The offset is only stored, if it is interrupted in the middle of reading. In a first implementation, this could even be removed for simplicity.

This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome.

For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219

lminaudier · 2016-01-06T09:31:16Z

Thanks for the fast reply and the pointer to the issue.

I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :)

ruflin · 2016-01-06T10:26:24Z

@lminaudier Always here to help.

mryanb · 2016-01-21T23:10:38Z

This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically.

Ragsboss · 2016-08-01T17:27:11Z

+1. Is anyone working on this? If not I could possibly take it up..

ruflin · 2016-08-02T08:23:53Z

@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details.

cFire · 2016-08-05T07:40:50Z

@Ragsboss @ruflin I'm delighted to see there's someone looking to pick this up. Is this happening? The reason I ask is because it may be possible for me to spend some time helping out with this in lieu of building another solution for gziped logs to use internally.

Ragsboss · 2016-08-05T17:32:10Z

@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know.
Few thoughts I had. From a pure functional viewpoint, defining a new input_type doesn't seem ideal as that would force users to author a new prospector in the config file. Instead I felt it may be better for the code to automatically deal with compressed files as long as they match the given filename patterns in the config file. The code can instantiate a different harvester (IIUC) based on the file extension/type. But from implementation viewpoint if this is turning out to be difficult, I think it's ok to burden/ask the users for some extra config...

ruflin · 2016-08-08T08:50:14Z

From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry.

This is a proposed approach to fix elastic#637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work. Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path.. Few pending things that we can do once we agree this approach is acceptable - Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched. - Write new tests - Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this

…ality in filebeat elastic#637 The test has two gzip files - one that covers the gzip size > raw file size and another that covers the gzip file < raw file size. This test is failing as I'm seeing the nasa-360.log.gz is producing duplicated events. Fix for this is coming in next commit.

I'm not happy about this change, but I'm pushing it temporarily as I'm switching jobs and hence my laptops. The problem is when the gzip file is complete and next scan finds the state, the offset in state is greater than size of gzip file and we treat it as truncated file and reindex from beginning. Fix is to add a new ActualSize method to LogSource interface and have GZipFile return infinity as the actual size. This way at-least scan will try to seek and fail for gzip-ed files thus not doing anything.. I'm thinking a better fix is to modify the state structure to store the fact that we reached EOF on a non-Continuable source, so the propsector can be smart about skipping such files.. elastic#637

ruflin · 2016-08-16T11:59:54Z

Here is the PR related to the above discussions: #2227

willsheppard · 2016-09-05T11:41:20Z

We would like filebeat to be able to read gzipped files, too.

Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch.

No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable.

ruflin · 2016-09-06T08:12:39Z

@willsheppard Thanks for the details. For the batch mode you will be interested in #2456 (in progress)

ruflin · 2016-09-16T15:08:29Z

Now that #2456 is merged this feature got even more interesting :-)

collabccie7 · 2016-10-18T02:02:11Z

Hello,

Has there been any update regarding the support for gzip files ? Please let know.

Thanks.

cFire · 2016-10-18T11:03:05Z

Idk about the others, but I've not gotten any time to work on this.

ruflin · 2016-10-24T06:48:47Z

No progress yet on this from my side.

willsheppard · 2016-11-22T09:55:55Z

This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters.

maddazzaofdudley · 2017-01-27T16:28:00Z

Has there been any movement on this as this too is the killer feature for me too

woodchalk · 2017-01-28T00:52:48Z

Throwing in my support for this feature.

plumpNation · 2017-01-29T19:44:11Z

Would be an awesome feature to have.

ruflin · 2017-01-30T13:55:16Z

There is this open PR here that still needs some work and also involves quite some discussions: #3070

jordansissel · 2017-01-31T03:54:08Z

Harvesters would only be opened based on filenames (no inode etc.).

@ruflin I believe inodes may to be tracked because logrotate (assuming this is a target use case) renames files and reuses file names. Unless another tracking mechanism (when is 'hello.txt.1.gz' a "new file" below, for example)

Example:

% ls -il /tmp/hello.txt*
103196 -rw-rw-r--. 1 jls jls 12 Jan 24 03:17 /tmp/hello.txt
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.1.gz

% cat test.conf
/tmp/hello.txt {
  rotate 5
  compress
}

% logrotate -s /tmp/example.logrotate -f test.conf

% ls -il /tmp/hello.txt*
103218 -rw-rw-r--. 1 jls jls 32 Jan 24 03:17 /tmp/hello.txt.1.gz
103131 -rw-rw-r--. 1 jls jls 32 Jan 24 03:16 /tmp/hello.txt.2.gz

^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'.

We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right?

ruflin · 2017-01-31T13:17:18Z

I exactly hit this issues during the implementation. That is why the implementation in #3070 is not fully consistent with the initial theory in this thread.

The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again.

ITSEC-Hescalona · 2020-04-06T22:56:00Z

+1 for gzip please

notzippy · 2020-05-07T22:26:00Z

As an alternative, you can use mkfifo and pipe that stream into filebeat and pipe all your files into that pipe. That way you do not need to worry when you are finished ingesting a file because you can just automatically start piping the next file into it. So my zcats go into zcat ... >fb input and in another terminal, I have filebeat < fb_input

bunnHHA · 2020-06-19T05:26:26Z

any update on gzip files? Using zcat solve some issues but it does need to be done in manual all the time.

elasticmachine · 2020-11-05T18:21:45Z

Pinging @elastic/integrations-services (Team:Services)

dwdixon · 2020-11-19T22:40:54Z

+1 to adding this feature...is there any update on the timeline for this being added to Filebeat?

Edit: Sorry for "spam" just saw that comment... : )

cancer13 · 2021-10-29T13:59:25Z

+1
any update?

NemchinovSergey · 2021-11-06T12:58:14Z

+1
parsing gzip would be great!
files with plain text are huge!

arasic · 2021-12-12T23:22:43Z

Hello everyone,

The feature has not been implemented even if it has been requested by many users.
The ticket #2227 was closed because there was no progress for some time. Some progess has been made, but I don't know whats remaining.

Not sure to understand if the feature was not implemented, is it because its a challenging problem or because lack of interest.

From what I understood, there are two difficulties:

difficult to track/keep the offest of what has already been processed
the files can be larger when extracted, like 50GBs and more?
Did I miss any other challenging question?

So now, we realized that not only we cannot process the .gz files, but they should be excluded in exclude_files: [".gz"], otherwise we will get scrambled text.

The most helpful thing I found was posted by Vasek Šulc:
https://discuss.elastic.co/t/filebeat-on-demand-loading-compressed-gz-files-from-stdin/184159, which is to use the stdin configurations in filebeat, and redirect or pipe the compressed file outputs using zcat.

botelastic · 2022-03-30T09:18:42Z

This issue doesn't have a Team:<team> label.

elasticmachine · 2022-03-30T09:19:03Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 · 2022-03-30T09:19:41Z

@nimarezainia From a pure product perspective, how does this issue fit in our actual roadmap and strategy?

nerophon · 2023-02-22T11:53:33Z

Competitors have this feature, e.g. Splunk forwarders, and also Graphana Loki: grafana/loki#5956.

nimarezainia · 2023-02-27T04:45:31Z

this is tracked i the enhancements repo how ever not slated for development due to other higher priority items on the list.

fzyzcjy · 2023-03-18T00:16:14Z

Hi, is there any updates? Thanks

nimarezainia · 2023-03-20T06:12:42Z

Hi, is there any updates? Thanks

@fzyzcjy right now we don't have a development date for this.

mathemaphysics · 2023-06-02T16:00:59Z

This would be fabulous to have this feature. Accessing files directly would allow reading of log files directly without decompression of the file hierarchy. Text compression is generally very good, so this would be in line with preservation of disk space.

matschaffer-roblox · 2023-12-10T20:03:17Z

@jlind23 @nimarezainia incase it's helpful here's how I found this issue today:

We use a lot of AWS EMR where Amazon handles ingest of some logs (usually enough) but not all. Last weekend we hit a problem with one of the services in the mix that weren't ingested. We started a new EMR deployment, scp'd most of the logs (some raw, some gz) from the old deployment and terminated it. Now I'd like to get the old logs ingested for analysis.

ML's Data Visualizer does a great job on the raw files and even spits out a filebeat config for me. It'd be impressive if filebeat just worked for the .gz'd logs as well so I could slurp up everything I scp'd.

Tangentially having a filebeat option for https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html would also make this task easier.

Being able to quickly ingest and analyze a handful of local logs could be a nice on-ramp to continual ingestion as well, provided all the artifacts (filebeat config, ingest pipeline) were easily portable to something like a cloud deployment.

botelastic · 2024-12-09T20:15:54Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

philhagen · 2025-01-11T15:11:52Z

This would be an immensely helpful feature. 👍

ruflin added the enhancement label Jan 6, 2016

monicasarbu added the Filebeat Filebeat label Jan 6, 2016

ruflin mentioned this issue Apr 27, 2016

Run Filebeat once for predefined files #880

Closed

Ragsboss mentioned this issue Aug 10, 2016

[637] Add support to index gzipped logs #2227

Closed

jsoriano added the Team:Services (Deprecated) Label for the former Integrations-Services team label Nov 5, 2020

andrewkroh mentioned this issue Nov 18, 2021

[crowdstrike/fdr] logfile input not compatible with falcon_data_replicator.py elastic/integrations#2194

Closed

jlind23 removed Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team labels Mar 30, 2022

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 30, 2022

jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 30, 2022

frittentheke mentioned this issue Apr 19, 2022

[Promtail] Support reading from compressed log files grafana/loki#5956

Closed

botelastic bot added the Stalled label Dec 9, 2024

botelastic bot removed the Stalled label Jan 11, 2025

jlind23 assigned AndersonQ Jan 16, 2025

Add an new input type to backfill gzipped logs #637

Add an new input type to backfill gzipped logs #637

Comments

lminaudier commented Jan 6, 2016

ruflin commented Jan 6, 2016

lminaudier commented Jan 6, 2016

ruflin commented Jan 6, 2016

mryanb commented Jan 21, 2016

Ragsboss commented Aug 1, 2016

ruflin commented Aug 2, 2016

cFire commented Aug 5, 2016

Ragsboss commented Aug 5, 2016 • edited Loading

ruflin commented Aug 8, 2016

ruflin commented Aug 16, 2016

willsheppard commented Sep 5, 2016 • edited Loading

ruflin commented Sep 6, 2016

ruflin commented Sep 16, 2016

collabccie7 commented Oct 18, 2016

cFire commented Oct 18, 2016

ruflin commented Oct 24, 2016

willsheppard commented Nov 22, 2016

maddazzaofdudley commented Jan 27, 2017

woodchalk commented Jan 28, 2017

plumpNation commented Jan 29, 2017

ruflin commented Jan 30, 2017

jordansissel commented Jan 31, 2017

ruflin commented Jan 31, 2017

ITSEC-Hescalona commented Apr 6, 2020

notzippy commented May 7, 2020

bunnHHA commented Jun 19, 2020

elasticmachine commented Nov 5, 2020

dwdixon commented Nov 19, 2020 • edited Loading

cancer13 commented Oct 29, 2021

NemchinovSergey commented Nov 6, 2021 • edited Loading

arasic commented Dec 12, 2021

botelastic bot commented Mar 30, 2022

elasticmachine commented Mar 30, 2022

jlind23 commented Mar 30, 2022 • edited Loading

nerophon commented Feb 22, 2023

nimarezainia commented Feb 27, 2023

fzyzcjy commented Mar 18, 2023

nimarezainia commented Mar 20, 2023

mathemaphysics commented Jun 2, 2023

matschaffer-roblox commented Dec 10, 2023 • edited Loading

botelastic bot commented Dec 9, 2024

philhagen commented Jan 11, 2025

Ragsboss commented Aug 5, 2016 •

edited

Loading

willsheppard commented Sep 5, 2016 •

edited

Loading

dwdixon commented Nov 19, 2020 •

edited

Loading

NemchinovSergey commented Nov 6, 2021 •

edited

Loading

jlind23 commented Mar 30, 2022 •

edited

Loading

matschaffer-roblox commented Dec 10, 2023 •

edited

Loading