-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an new input type to backfill gzipped logs #637
Comments
I would see the implementation as following:
This would be nice to have but I think it is not on the top of our priority list. A community contribution here would be more then welcome. For your second issue about running filebeat only until first completion, lets refer to this issue: https://github.com/elastic/filebeat/issues/219 |
Thanks for the fast reply and the pointer to the issue. I will try to look at your implementation proposal. I am still quite new to Golang and the project, so I can't promise anything :) |
@lminaudier Always here to help. |
This would be a great feature addition. Currently the splunk-forwarder does something similar and will index log rotated files that have been gziped automatically. |
+1. Is anyone working on this? If not I could possibly take it up.. |
@Ragsboss I don't think anyone is working on that. Would be great to have a PR for that to discuss the details. |
@cFire please feel free to take this up. I've started looking at the code familiarizing myself with Go in general and FileBeat code. But I haven't started the real work yet. I'll try to help you in anyway you want, just let me know. |
From an implementation point of view I think we should go the route of having a separate input type. Currently the way filebeat is designed is that a prospector and harvester type are tightly coupled, so a prospector only starts one type of harvesters. It is ok if gzip harvester reuses lots of code from the log harvester (which I think it will) but as log tailing and reading a file completely only once are form my perspective two quite different behaviours. The question which will also be raised is if the gzip files will change their name over time (means have to be tracked based on inode / device) or it is enough to just store the filename and a read / unread flag in the registry. |
This is a proposed approach to fix elastic#637. In this approach, we reuse existing input type 'log' to transparently process gzipped files. So there is no configuration change for filebeat users except for ensuring that the filename patterns match gzipped file names. The filebeat recognizes gzipped files using extension '.gz', any other extension will not work. Ruflin noted an alternative approach of introducing a brand new input type to deal with gzipped files. I'm ok about either approaches. But reusing existing input type seemed more intuitive from filebeat users viewpoint and for me it was easier to implement. I hope this change gives Ruflin a better view of how this approach looks like. If we still feel a new input type is better, we can certainly go down that path.. Few pending things that we can do once we agree this approach is acceptable - Test for cases where a regular file gets log rotated and compressed. In this case compressed file will have a different inode and today this works assuming filename patterns don't match .gz files, but with this support, .gz files will typically be matched. - Write new tests - Go's compress/gzip package doesn't support seeking, so resuming is not supported. Need to decide if we want to support this or gracefully handle this
…ality in filebeat elastic#637 The test has two gzip files - one that covers the gzip size > raw file size and another that covers the gzip file < raw file size. This test is failing as I'm seeing the nasa-360.log.gz is producing duplicated events. Fix for this is coming in next commit.
I'm not happy about this change, but I'm pushing it temporarily as I'm switching jobs and hence my laptops. The problem is when the gzip file is complete and next scan finds the state, the offset in state is greater than size of gzip file and we treat it as truncated file and reindex from beginning. Fix is to add a new ActualSize method to LogSource interface and have GZipFile return infinity as the actual size. This way at-least scan will try to seek and fail for gzip-ed files thus not doing anything.. I'm thinking a better fix is to modify the state structure to store the fact that we reached EOF on a non-Continuable source, so the propsector can be smart about skipping such files.. elastic#637
Here is the PR related to the above discussions: #2227 |
We would like filebeat to be able to read gzipped files, too. Our main use of filebeat would be to take a set of rotated, gzipped logs that represent the previous day's events, and send them to elasticsearch. No tailing or running as a service is required, so a "batch mode" would also be good, but other workarounds solely using filebeat would also be acceptable. |
@willsheppard Thanks for the details. For the batch mode you will be interested in #2456 (in progress) |
Now that #2456 is merged this feature got even more interesting :-) |
Hello, Has there been any update regarding the support for gzip files ? Please let know. Thanks. |
Idk about the others, but I've not gotten any time to work on this. |
No progress yet on this from my side. |
This gzip input filter would be the killer feature for us. We're being forced to consider writing an Elasticsearch ingest script from scratch which writes to the Bulk API, because we need to operate on logs in-place (no space to unzip them), and we would be using batch-mode (#2456) to ingest yesterday's logs from our web clusters. |
Has there been any movement on this as this too is the killer feature for me too |
Throwing in my support for this feature. |
Would be an awesome feature to have. |
There is this open PR here that still needs some work and also involves quite some discussions: #3070 |
@ruflin I believe inodes may to be tracked because Example:
^^ Above, 'hello.txt.2.gz' is the same file (inode) as previous 'hello.txt.1.gz'. We can probably achieve this without tracking inodes (tracking modification time and only reading .gz files after they have been idle for some small time?), but I think the filename alone is not enough because file names are reused, right? |
I exactly hit this issues during the implementation. That is why the implementation in #3070 is not fully consistent with the initial theory in this thread. The main difference now to a "normal" file is that it is expected, that a gz never changes, and if it changes, the complete file is read from the beginning again. |
+1 for gzip please |
As an alternative, you can use |
any update on gzip files? Using zcat solve some issues but it does need to be done in manual all the time. |
Pinging @elastic/integrations-services (Team:Services) |
+1 to adding this feature...is there any update on the timeline for this being added to Filebeat? Edit: Sorry for "spam" just saw that comment... : ) |
+1 |
+1 |
Hello everyone, The feature has not been implemented even if it has been requested by many users. Not sure to understand if the feature was not implemented, is it because its a challenging problem or because lack of interest. From what I understood, there are two difficulties:
So now, we realized that not only we cannot process the .gz files, but they should be excluded in exclude_files: [".gz"], otherwise we will get scrambled text. The most helpful thing I found was posted by Vasek Šulc: |
This issue doesn't have a |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
@nimarezainia From a pure product perspective, how does this issue fit in our actual roadmap and strategy? |
Competitors have this feature, e.g. Splunk forwarders, and also Graphana Loki: grafana/loki#5956. |
this is tracked i the enhancements repo how ever not slated for development due to other higher priority items on the list. |
Hi, is there any updates? Thanks |
@fzyzcjy right now we don't have a development date for this. |
This would be fabulous to have this feature. Accessing files directly would allow reading of log files directly without decompression of the file hierarchy. Text compression is generally very good, so this would be in line with preservation of disk space. |
@jlind23 @nimarezainia incase it's helpful here's how I found this issue today: We use a lot of AWS EMR where Amazon handles ingest of some logs (usually enough) but not all. Last weekend we hit a problem with one of the services in the mix that weren't ingested. We started a new EMR deployment, scp'd most of the logs (some raw, some gz) from the old deployment and terminated it. Now I'd like to get the old logs ingested for analysis. ML's Data Visualizer does a great job on the raw files and even spits out a filebeat config for me. It'd be impressive if filebeat just worked for the .gz'd logs as well so I could slurp up everything I scp'd. Tangentially having a filebeat option for https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html would also make this task easier. Being able to quickly ingest and analyze a handful of local logs could be a nice on-ramp to continual ingestion as well, provided all the artifacts (filebeat config, ingest pipeline) were easily portable to something like a cloud deployment. |
Hi! We're labeling this issue as |
This would be an immensely helpful feature. 👍 |
Hi,
Following this discussion on filebeat forum I would like to ask if it is possible to implement a solution to easily backfill old gzipped logs with filebeat.
The proposed solution mentioned in the topic is to add a new dedicated
input_type
.It is also mentioned in the topic that when filebeat reaches the end of input on
stdin
it does not give you the hand back and waits for new lines which makes things hard to script to perform backfilling.What are your thoughts on this ?
Thanks for your hard work.
The text was updated successfully, but these errors were encountered: