-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[637] Add support to index gzipped logs #2227
Conversation
This is a proposed approach to fix elastic#637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work. Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path.. Few pending things that we can do once we agree this approach is acceptable - Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched. - Write new tests - Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
1 similar comment
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
Thanks a lot for the contribution. It's is very good to have the code base for discussion. Overall the implementations seems to be simpler then I expected which is great. I would prefer that we modify the implementation to a different input_type instead of relying on the file ending. Otherwise I already see the request to support other file endings etc. and I'm sure the support request for zip files is also only around the corner. Like this we can separate the implementations and don't add complexity to the already too complex log harvester. There only problem at the moment is that the separation in the code currently in the code is not 100% clean, but I'm working on this to clean it up and parts of it can also happen after this PR got merged. For inspiration on how to deal with an additional input type, best have a look at One additional point I would like to have it separate and probably the most important one is the one you mentioned above about duplicated data. If someone wants to feed in the compressed files he should really have to think about how to prevent duplicate data. Currently I don't see an option on how we could prevent duplicated data if someone uses Which feature should be supported for Supported:
Not supported:
For the Offset this would mean, either the file is full read or it is not read. The state is only updated when the harvester finished the file. It will never opened again (no size comparison etc.). In case filebeat is stopped in the middle of reading the file, it will start over again after the restart. Let me know your thoughts on the above. For your indentation issue: We use the golang standard. To do it automatically, just run |
…ality in filebeat elastic#637 The test has two gzip files - one that covers the gzip size > raw file size and another that covers the gzip file < raw file size. This test is failing as I'm seeing the nasa-360.log.gz is producing duplicated events. Fix for this is coming in next commit.
I'm not happy about this change, but I'm pushing it temporarily as I'm switching jobs and hence my laptops. The problem is when the gzip file is complete and next scan finds the state, the offset in state is greater than size of gzip file and we treat it as truncated file and reindex from beginning. Fix is to add a new ActualSize method to LogSource interface and have GZipFile return infinity as the actual size. This way at-least scan will try to seek and fail for gzip-ed files thus not doing anything.. I'm thinking a better fix is to modify the state structure to store the fact that we reached EOF on a non-Continuable source, so the propsector can be smart about skipping such files.. elastic#637
@ruflin thank you for your insightful review. I understand your approach of using a separate input type. But my thought process was to explore the possibility of using a single harvester with differences between file types encapsulated behind a single interface. Fortunately code already had such an interface viz. As we add more configuration, features etc to Harvester/Prospector etc, it will have two problems with a separate input type approach. One is config duplication for users and the other is code maintenance cost for FileBeat developers. Splunk forwarder also works this way. See http://docs.splunk.com/Documentation/Splunk/latest/Data/Monitorfilesanddirectories#How_Splunk_software_monitors_archive_files. No special configuration is needed to index compressed files. But if you still feel separate input_type is the way to go, I'll perhaps wait for your refactoring changes to go in before picking this up. I agree with your observation that this is more trickier to implement on top of the current state of the code. For event duplication, we can use the file name blacklist/whitelist approach. See https://answers.splunk.com/answers/50653/log-rotation-for-compressed-files.html. For resumability (aka offset), it's ok to start with no support. But whether we duplicate events or skip events is a choice. In general, erring on event duplication seems safer as that means we are at-least giving (albeit extra) visibility. But I'm also not comfortable making that hard-coded assumption as valid for all use cases. We could consider exposing a config option like 'errOnEventDuplication' or such. Good to know about |
@Ragsboss Thanks for sharing the details above. The config duplication is definitively an issue that is going to happen, but I prefer here the flexibility of having two different prospectors / harvesters. Otherwise a harvester will have some configuration options that only work on some of the file sources and not all which I think will complicate the code instead. A potential solution to handle duplicated configurations is go-ucfg which we use for configuration management and already now supports variables ... An other reason I prefer to have it separate is edge cases. Already log input has an impressive amount of of edge cases which are rare but still have to be handled and most of them we only learned over time. I assume the same will be the case for gz input. The first implementation is quite simple but as soon as it hits production, interesting cases will pop up. Having it separate will make it much easier to find the the issue and fix it I hope. It's interesting to see the way Splunk forwarder is doing. As you see their, also the behaviour of compress and non compress files is different (which make total sense from a technical point of view). As we already have the different For the white / blacklisting: Filebeat already supports exclude_files: https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html#exclude-files I would still prefer to go with a separate prospector / harvester. I'm more then happy if you take the code as it is and "hack" the new harvester type inside. It should not require too much changes on the above implementation, best is to copy the way stdin does it. This will not be very clean, but will afterwards allow me to better abstract out the parts needed. I like to have at least 3 different implementations of something before I start to abstract it out to make sure, the abstraction does the right thing. For the first implementation I suggest to do the following: We just assume this harvester is only hear to read |
@ruflin I'm sorry for not being active as I had switched jobs and new one is quite packed.. Went through your PR and provided my comments.. It looks good overall. |
This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed. The implementation with compression allows to add other compressed files in the future. This implementation has currently the following limitations: * No (correct) offset is reported for reading gzip files. This has the following consequences below. * In case a gzip file changes its size, the full file will be read again from the beginning. * If the reading of a gzip file is interrupted, the state will be set to the file size on closing the harvester and the data will not be sent again. * Filebeat cannot detect if content now inside the gzip files was read before as a normal file as the inode of the file changes. This is an alternative implementation of elastic#2227
hallo? what is the status of this? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
1 similar comment
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Hello @Ragsboss, seems like there is no progress on this PR for more than 6 months. Will close this PR for now. Thanks! |
This is a proposed approach to fix #637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work.
Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path..
Few pending things that we can do once (and if) we agree this approach is acceptable