-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept multiple ingest pipelines in Filebeat #8914
Accept multiple ingest pipelines in Filebeat #8914
Conversation
There's still one TODO item left for this PR related to documentation but I think it's ready for a code review. Thanks folks! |
CHANGELOG-developer.asciidoc
Outdated
@@ -64,3 +64,4 @@ The list below covers the major changes between 6.3.0 and master only. | |||
- Allow to disable config resolver using the `Settings.DisableConfigResolver` field when initializing libbeat. {pull}8769[8769] | |||
- Add `mage.AddPlatforms` to allow to specify dependent platforms when building a beat. {pull}8889[8889] | |||
- Add `cfgwarn.CheckRemoved6xSetting(s)` to display a warning for options removed in 7.0. {pull}8909[8909] | |||
- Filesets can now define multiple ingest pipelines, with the first one considered as the root pipeline. {pull}8914[8914] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is “root” synonmous with entrypoint in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'm happy to change the terminology to whatever is clearest to most folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like entry point
better because "root" suggests that there is always a hierarchy. I'm not sure it that's true. Is it possible that devs might want to specify multiple pipelines as a way of breaking their ingest pipeline configs into smaller pipelines that encapsulate specific processing tasks? Also, can other pipelines besides the first one delegate processing? If so, I would avoid using "root".
jenkins, test this |
@dedemorton I've added docs for this feature in df2488a27e171a8f10d8fac6766ae7a352dbf1a8 but I'm not sure about my language and structure/organization. I'd love for you to take a look, if you have some time. Thanks! |
jenkins, test this |
1 similar comment
jenkins, test this |
What happens if a user uses a module with multiple pipelines against and older version of Elasticsearch? |
I tested this PR with Elasticsearch 6.4.0, where neither the There are two related parts to this PR. One worked with ES 6.4.0, but the other did not. Specifically:
Note that the pipeline loading (within a fileset) is short-circuiting. For example, imagine that a module developer has specified 3 pipelines in the list under I can see four options on how to proceed:
My personal preference would be option 1. It is essentially like option 4 but doesn't give up on the remaining pipelines either, so if multiple pipelines have different errors, at least the user can find out about them all at once. Option 2 could lead to an inconsistent state (if any of the |
I would not go with option 1, because it pollutes the ES instance of users with possibly unused pipelines. I prefer the solutions with rollbacks. |
Can you elaborate on why you think 3 is too expensive? I would expect checking for the pipelines does not happen too often so I would not be worried about the extra load. I just realised this also touches and other problem: What if geo or user_agent are not installed? At the moment I think it does not get loaded as we only have one pipeline but not sure how good the error is. We already have requirements for the pipelines in the manifest, there we could also add requirements for the ES version and use option 3? |
It might not be. It would depend on how many multi-pipeline filesets we end up with over time. For every such fileset, we would call an ES API (the simulate ingest pipeline API) for each pipeline in that fileset. As you note, all of this would only happen at startup time. So yeah, maybe compared to the overall runtime of the Filebeat process it's not too expensive, just makes the initial startup time a bit longer. |
Given the discussion above, I'm inclined to go with option 3 (pre-validation) now. @kvch, since you took the time to weigh in as well, what are your thoughts about that? Would you still prefer option 2 (rollbacks)? |
I am fine with that option. Just keep in mind that simulating multiple pipelines are not yet implemented: elastic/elasticsearch#35495 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jenkins, test this |
@ycombinator I'm good with merging. But in case #9777 gets in before this one, a rebase would be nice. |
Upcating unit test
#9811) Cherry-pick of PR #8914 to 6.x branch. Original message: Motivated by #8852 (comment). Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to: - run sub-pipelines via the [`pipeline` processor](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/pipeline-processor.html), and - conditionally run processors via an [`if` field](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ingest-processors.html). These abilities combined present the opportunity for a fileset to ingest the same _logical_ information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline. This PR allows filesets to specify one or more ingest pipelines via the `ingest_pipeline` property in their `manifest.yml`. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline. #### Example with multiple pipelines ```yaml ingest_pipeline: - pipeline-ze-boss.json - pipeline-plain.json - pipeline-json.json ``` #### Example with a single pipeline _This is just to show that the existing functionality will continue to work as-is._ ```yaml ingest_pipeline: pipeline.json ``` Now, if the root pipeline wants to delegate processing to another pipeline, it must use a `pipeline` processor to do so. This processor's `name` field will need to reference the other pipeline by its name. To ensure correct referencing, the `name` field must be specified as follows: ```json { "pipeline" : { "name": "{< IngestPipeline "pipeline-plain" >}" } } ``` This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with `filebeat-<version>-<module>-<fileset>-` when loading them into Elasticsearch.
… 6.5 (#10001) Follow up to #8914. In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0. This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.
…nes is being used with ES < 6.5 (#10038) Cherry-pick of PR #10001 to 6.x branch. Original message: Follow up to #8914. In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0. This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.
Motivated by #8852 (comment).
Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to:
pipeline
processor, andif
field.These abilities combined present the opportunity for a fileset to ingest the same logical information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline.
This PR allows filesets to specify one or more ingest pipelines via the
ingest_pipeline
property in theirmanifest.yml
. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline.Example with multiple pipelines
Example with a single pipeline
This is just to show that the existing functionality will continue to work as-is.
Now, if the root pipeline wants to delegate processing to another pipeline, it must use a
pipeline
processor to do so. This processor'sname
field will need to reference the other pipeline by its name. To ensure correct referencing, thename
field must be specified as follows:This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with
filebeat-<version>-<module>-<fileset>-
when loading them into Elasticsearch.