[637] Add support to index gzipped logs #2227

Ragsboss · 2016-08-10T22:47:22Z

This is a proposed approach to fix #637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work.

Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path..

Few pending things that we can do once (and if) we agree this approach is acceptable

Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched.
Write new tests
Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this
The tabs/indentation seems to be different. I use Sublime 3 on iMac and tried using both spaces and tab, but still the whitespace diffs are showing up...

This is a proposed approach to fix elastic#637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work. Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path.. Few pending things that we can do once we agree this approach is acceptable - Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched. - Write new tests - Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this

elasticsearch-release · 2016-08-10T22:52:03Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

elasticsearch-release · 2016-08-10T22:52:04Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

ruflin · 2016-08-11T14:26:43Z

Thanks a lot for the contribution. It's is very good to have the code base for discussion. Overall the implementations seems to be simpler then I expected which is great.

I would prefer that we modify the implementation to a different input_type instead of relying on the file ending. Otherwise I already see the request to support other file endings etc. and I'm sure the support request for zip files is also only around the corner. Like this we can separate the implementations and don't add complexity to the already too complex log harvester. There only problem at the moment is that the separation in the code currently in the code is not 100% clean, but I'm working on this to clean it up and parts of it can also happen after this PR got merged. For inspiration on how to deal with an additional input type, best have a look at stdin.

One additional point I would like to have it separate and probably the most important one is the one you mentioned above about duplicated data. If someone wants to feed in the compressed files he should really have to think about how to prevent duplicate data. Currently I don't see an option on how we could prevent duplicated data if someone uses /var/log/* as pattern. As soon as a new .gz file is generated, all files will be resent. We are thinking about adding some fingerprint support to detect if a file changed or is the same. This could potentially mitigate the problem but it will take some more time until we have this feature. Instead of trying to figure out here how to prevent duplicates I suggest to leave it to the user for the moment.

Which feature should be supported for .gz prospector/harvester? My suggestion would be:

Supported:

Multiline
Json
File identifier (in case or renaming)

Not supported:

Offset

For the Offset this would mean, either the file is full read or it is not read. The state is only updated when the harvester finished the file. It will never opened again (no size comparison etc.). In case filebeat is stopped in the middle of reading the file, it will start over again after the restart.

Let me know your thoughts on the above.

For your indentation issue: We use the golang standard. To do it automatically, just run make fmt and it will do it for you.

…ality in filebeat elastic#637 The test has two gzip files - one that covers the gzip size > raw file size and another that covers the gzip file < raw file size. This test is failing as I'm seeing the nasa-360.log.gz is producing duplicated events. Fix for this is coming in next commit.

I'm not happy about this change, but I'm pushing it temporarily as I'm switching jobs and hence my laptops. The problem is when the gzip file is complete and next scan finds the state, the offset in state is greater than size of gzip file and we treat it as truncated file and reindex from beginning. Fix is to add a new ActualSize method to LogSource interface and have GZipFile return infinity as the actual size. This way at-least scan will try to seek and fail for gzip-ed files thus not doing anything.. I'm thinking a better fix is to modify the state structure to store the fact that we reached EOF on a non-Continuable source, so the propsector can be smart about skipping such files.. elastic#637

Ragsboss · 2016-08-12T15:19:22Z

@ruflin thank you for your insightful review. I understand your approach of using a separate input type. But my thought process was to explore the possibility of using a single harvester with differences between file types encapsulated behind a single interface. Fortunately code already had such an interface viz. FileSource. Adding support for newer compression types is achieved by having a new type implement this interface appropriately. So it's modular and extensible.

As we add more configuration, features etc to Harvester/Prospector etc, it will have two problems with a separate input type approach. One is config duplication for users and the other is code maintenance cost for FileBeat developers. Splunk forwarder also works this way. See http://docs.splunk.com/Documentation/Splunk/latest/Data/Monitorfilesanddirectories#How_Splunk_software_monitors_archive_files. No special configuration is needed to index compressed files.

But if you still feel separate input_type is the way to go, I'll perhaps wait for your refactoring changes to go in before picking this up. I agree with your observation that this is more trickier to implement on top of the current state of the code.

For event duplication, we can use the file name blacklist/whitelist approach. See https://answers.splunk.com/answers/50653/log-rotation-for-compressed-files.html.

For resumability (aka offset), it's ok to start with no support. But whether we duplicate events or skip events is a choice. In general, erring on event duplication seems safer as that means we are at-least giving (albeit extra) visibility. But I'm also not comfortable making that hard-coded assumption as valid for all use cases. We could consider exposing a config option like 'errOnEventDuplication' or such.

Good to know about make fmt - will do.

ruflin · 2016-08-15T07:30:01Z

@Ragsboss Thanks for sharing the details above.

The config duplication is definitively an issue that is going to happen, but I prefer here the flexibility of having two different prospectors / harvesters. Otherwise a harvester will have some configuration options that only work on some of the file sources and not all which I think will complicate the code instead. A potential solution to handle duplicated configurations is go-ucfg which we use for configuration management and already now supports variables ...

An other reason I prefer to have it separate is edge cases. Already log input has an impressive amount of of edge cases which are rare but still have to be handled and most of them we only learned over time. I assume the same will be the case for gz input. The first implementation is quite simple but as soon as it hits production, interesting cases will pop up. Having it separate will make it much easier to find the the issue and fix it I hope.

It's interesting to see the way Splunk forwarder is doing. As you see their, also the behaviour of compress and non compress files is different (which make total sense from a technical point of view). As we already have the different input_types I think we should make use of it. When implementing features I prefer to make options configurable instead of trying to figure out what could be the right thing. One of the reasons here is that over the last year for example ignore_older which looks like a simple configuration option, turned out to be not so simple and had be split up in 3-4 different configuration options as one option had too many assumptions which were not true in all cases.

For the white / blacklisting: Filebeat already supports exclude_files: https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html#exclude-files

I would still prefer to go with a separate prospector / harvester. I'm more then happy if you take the code as it is and "hack" the new harvester type inside. It should not require too much changes on the above implementation, best is to copy the way stdin does it. This will not be very clean, but will afterwards allow me to better abstract out the parts needed. I like to have at least 3 different implementations of something before I start to abstract it out to make sure, the abstraction does the right thing.

For the first implementation I suggest to do the following: We just assume this harvester is only hear to read gz compress files once and send them to elasticsearch. In case the file changes, it will treat it like a new file. From there we can start to iterate and add more capability. This should help to have a very basic implementation soonish without having to cover the edge cases.

ruflin · 2016-11-29T13:54:52Z

@Ragsboss After thinking more about this I came up with a third solution which is a "compressed" param in the config. Like this it is extendable to other compressions. I took part of your code and built this PR out of it: #3070 Would be good to hear your feedback on that.

Ragsboss · 2016-11-29T17:05:45Z

@ruflin I'm sorry for not being active as I had switched jobs and new one is quite packed.. Went through your PR and provided my comments.. It looks good overall.

This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed. The implementation with compression allows to add other compressed files in the future. This implementation has currently the following limitations: * No (correct) offset is reported for reading gzip files. This has the following consequences below. * In case a gzip file changes its size, the full file will be read again from the beginning. * If the reading of a gzip file is interrupted, the state will be set to the file size on closing the harvester and the data will not be sent again. * Filebeat cannot detect if content now inside the gzip files was read before as a normal file as the inode of the file changes. This is an alternative implementation of elastic#2227

fer-marino · 2019-11-05T15:15:18Z

hallo? what is the status of this?

elasticmachine · 2019-11-05T15:16:39Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

elasticmachine · 2019-11-05T15:16:40Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

kaiyan-sheng · 2020-01-17T18:16:01Z

Hello @Ragsboss, seems like there is no progress on this PR for more than 6 months. Will close this PR for now. Thanks!

ruflin added review Filebeat Filebeat labels Aug 11, 2016

Raghavendra Rachamadugu added 2 commits August 11, 2016 16:43

ruflin mentioned this pull request Aug 16, 2016

Add an new input type to backfill gzipped logs #637

Open

ruflin added in progress Pull request is currently in progress. and removed review labels Aug 22, 2016

ruflin mentioned this pull request Nov 29, 2016

Add support for gzip compressed files #3070

Closed

monicasarbu added help wanted Indicates that a maintainer wants help on an issue or pull request module labels Nov 13, 2017

ruflin removed the module label Aug 27, 2018

kaiyan-sheng closed this Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[637] Add support to index gzipped logs #2227

[637] Add support to index gzipped logs #2227

Ragsboss commented Aug 10, 2016

elasticsearch-release commented Aug 10, 2016

elasticsearch-release commented Aug 10, 2016

ruflin commented Aug 11, 2016

Ragsboss commented Aug 12, 2016 •

edited

Loading

ruflin commented Aug 15, 2016

ruflin commented Nov 29, 2016

Ragsboss commented Nov 29, 2016

fer-marino commented Nov 5, 2019

elasticmachine commented Nov 5, 2019

elasticmachine commented Nov 5, 2019

kaiyan-sheng commented Jan 17, 2020

[637] Add support to index gzipped logs #2227

[637] Add support to index gzipped logs #2227

Conversation

Ragsboss commented Aug 10, 2016

elasticsearch-release commented Aug 10, 2016

elasticsearch-release commented Aug 10, 2016

ruflin commented Aug 11, 2016

Ragsboss commented Aug 12, 2016 • edited Loading

ruflin commented Aug 15, 2016

ruflin commented Nov 29, 2016

Ragsboss commented Nov 29, 2016

fer-marino commented Nov 5, 2019

elasticmachine commented Nov 5, 2019

elasticmachine commented Nov 5, 2019

kaiyan-sheng commented Jan 17, 2020

Ragsboss commented Aug 12, 2016 •

edited

Loading