Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept multiple ingest pipelines in Filebeat #8914

Merged
merged 15 commits into from
Dec 27, 2018
Merged

Accept multiple ingest pipelines in Filebeat #8914

merged 15 commits into from
Dec 27, 2018

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Nov 2, 2018

Motivated by #8852 (comment).

Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to:

These abilities combined present the opportunity for a fileset to ingest the same logical information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline.

This PR allows filesets to specify one or more ingest pipelines via the ingest_pipeline property in their manifest.yml. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline.

Example with multiple pipelines

ingest_pipeline:
  - pipeline-ze-boss.json 
  - pipeline-plain.json
  - pipeline-json.json

Example with a single pipeline

This is just to show that the existing functionality will continue to work as-is.

ingest_pipeline: pipeline.json

Now, if the root pipeline wants to delegate processing to another pipeline, it must use a pipeline processor to do so. This processor's name field will need to reference the other pipeline by its name. To ensure correct referencing, the name field must be specified as follows:

{
  "pipeline" : {
    "name": "{< IngestPipeline "pipeline-plain" >}"
  }
}

This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with filebeat-<version>-<module>-<fileset>- when loading them into Elasticsearch.

@ycombinator ycombinator added enhancement in progress Pull request is currently in progress. Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. v7.0.0-alpha1 v6.6.0 labels Nov 2, 2018
@ycombinator ycombinator changed the title Accept multiple ingest pipelines in Filebeat WIP: Accept multiple ingest pipelines in Filebeat Nov 2, 2018
@ycombinator ycombinator requested review from kvch, webmat and ruflin November 8, 2018 00:57
@ycombinator
Copy link
Contributor Author

There's still one TODO item left for this PR related to documentation but I think it's ready for a code review. Thanks folks!

@@ -64,3 +64,4 @@ The list below covers the major changes between 6.3.0 and master only.
- Allow to disable config resolver using the `Settings.DisableConfigResolver` field when initializing libbeat. {pull}8769[8769]
- Add `mage.AddPlatforms` to allow to specify dependent platforms when building a beat. {pull}8889[8889]
- Add `cfgwarn.CheckRemoved6xSetting(s)` to display a warning for options removed in 7.0. {pull}8909[8909]
- Filesets can now define multiple ingest pipelines, with the first one considered as the root pipeline. {pull}8914[8914]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is “root” synonmous with entrypoint in this context?

Copy link
Contributor Author

@ycombinator ycombinator Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I'm happy to change the terminology to whatever is clearest to most folks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like entry point better because "root" suggests that there is always a hierarchy. I'm not sure it that's true. Is it possible that devs might want to specify multiple pipelines as a way of breaking their ingest pipeline configs into smaller pipelines that encapsulate specific processing tasks? Also, can other pipelines besides the first one delegate processing? If so, I would avoid using "root".

@ycombinator
Copy link
Contributor Author

jenkins, test this

@ycombinator
Copy link
Contributor Author

@dedemorton I've added docs for this feature in df2488a27e171a8f10d8fac6766ae7a352dbf1a8 but I'm not sure about my language and structure/organization. I'd love for you to take a look, if you have some time. Thanks!

@ycombinator ycombinator changed the title WIP: Accept multiple ingest pipelines in Filebeat Accept multiple ingest pipelines in Filebeat Nov 8, 2018
@ycombinator ycombinator removed the in progress Pull request is currently in progress. label Nov 8, 2018
@ycombinator
Copy link
Contributor Author

jenkins, test this

1 similar comment
@ycombinator
Copy link
Contributor Author

jenkins, test this

@ruflin
Copy link
Contributor

ruflin commented Nov 12, 2018

What happens if a user uses a module with multiple pipelines against and older version of Elasticsearch?

@ycombinator
Copy link
Contributor Author

ycombinator commented Nov 14, 2018

What happens if a user uses a module with multiple pipelines against and older version of Elasticsearch?

I tested this PR with Elasticsearch 6.4.0, where neither the pipelines processor nor the if field on processors exist.

There are two related parts to this PR. One worked with ES 6.4.0, but the other did not. Specifically:

  • The part where multiple ingest pipelines are specified under the ingest_pipeline key in the module's manifest.yml file will work with older versions of ES. All pipelines listed will get loaded into ES.

  • The part where a pipeline uses the pipeline processor or the if processor field (both features introduced in ES 6.5) will not work with older versions of ES. Concretely, when Filebeat attempts to load such a pipeline into ES, ES will return a 400 with an error message that Filebeat emits to its log. FWIW, this can happen today (in master) as well if a pipeline is using a processor that has not been installed in ES as a plugin, e.g. the CSV processor plugin.

Note that the pipeline loading (within a fileset) is short-circuiting. For example, imagine that a module developer has specified 3 pipelines in the list under ingest_pipeline in the module's manifest.yml. Let's say the 2nd one in the list fails loading because it contains a pipeline processor, or an if somewhere, or for some other reason. The 1st pipeline in the list would've been successfully loaded into ES but the 3rd one will not get loaded. And obviously the 2nd one won't get loaded either because it caused the ES error.

I can see four options on how to proceed:

  1. Don't load the pipelines (within a fileset) in a short-circuiting fashion. Instead, even if one pipeline fails to load for some reason, emit the error in the log, but continue to try and load the remaining pipelines in the list.

  2. If a pipeline in the list fails to load, try to DELETE previously-loaded pipelines for the fileset (i.e rollback). This way either all pipelines within the fileset get loaded or none of them do.

  3. Before attempting to load any pipelines for a fileset, ask Elasticsearch to validate all of them. Again, this way either all pipelines within the fileset get loaded or none of them do.

  4. Keep it simple and keep the current short-circuiting approach. Depending on the error, users might need to upgrade ES or perform some other fix, such as loading a plugin, or filing a bug to get the pipeline definition in Filebeat fixed up. Once the fix is made, they can simply re-run Filebeat and any previously-erroring pipelines will get loaded into ES.

My personal preference would be option 1. It is essentially like option 4 but doesn't give up on the remaining pipelines either, so if multiple pipelines have different errors, at least the user can find out about them all at once. Option 2 could lead to an inconsistent state (if any of the DELETEs fail). Option 3 is probably the safest but also unjustifiably expensive.

@kvch
Copy link
Contributor

kvch commented Nov 14, 2018

I would not go with option 1, because it pollutes the ES instance of users with possibly unused pipelines. I prefer the solutions with rollbacks.

@ruflin
Copy link
Contributor

ruflin commented Nov 19, 2018

Can you elaborate on why you think 3 is too expensive? I would expect checking for the pipelines does not happen too often so I would not be worried about the extra load.

I just realised this also touches and other problem: What if geo or user_agent are not installed? At the moment I think it does not get loaded as we only have one pipeline but not sure how good the error is. We already have requirements for the pipelines in the manifest, there we could also add requirements for the ES version and use option 3?

@ycombinator
Copy link
Contributor Author

Can you elaborate on why you think 3 is too expensive? I would expect checking for the pipelines does not happen too often so I would not be worried about the extra load.

It might not be. It would depend on how many multi-pipeline filesets we end up with over time. For every such fileset, we would call an ES API (the simulate ingest pipeline API) for each pipeline in that fileset. As you note, all of this would only happen at startup time. So yeah, maybe compared to the overall runtime of the Filebeat process it's not too expensive, just makes the initial startup time a bit longer.

@ycombinator
Copy link
Contributor Author

Given the discussion above, I'm inclined to go with option 3 (pre-validation) now. @kvch, since you took the time to weigh in as well, what are your thoughts about that? Would you still prefer option 2 (rollbacks)?

@kvch
Copy link
Contributor

kvch commented Nov 21, 2018

I am fine with that option. Just keep in mind that simulating multiple pipelines are not yet implemented: elastic/elasticsearch#35495
Also, I have implemented simulating pipelines in a script: https://github.com/elastic/beats/blob/master/filebeat/scripts/tester/main.go#L252 You might find something useful there.

@ycombinator ycombinator added v6.7.0 and removed v6.6.0 labels Dec 21, 2018
Copy link

@urso urso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ycombinator
Copy link
Contributor Author

jenkins, test this

@ycombinator
Copy link
Contributor Author

@urso @ruflin I could use your guidance on whether to merge this PR now or not. Travis CI is green but Jenkins CI is not. The failure in Jenkins CI is because of the Too many dynamic script compilations within, max: [75/5m] error which will hopefully be fixed by #9777.

@ruflin
Copy link
Contributor

ruflin commented Dec 27, 2018

@ycombinator I'm good with merging. But in case #9777 gets in before this one, a rebase would be nice.

@ycombinator ycombinator merged commit 5ba1f11 into elastic:master Dec 27, 2018
@ycombinator ycombinator deleted the filebeat-multiple-ingest-pipelines branch December 27, 2018 19:19
@ycombinator ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Dec 27, 2018
ycombinator added a commit that referenced this pull request Dec 28, 2018
#9811)

Cherry-pick of PR #8914 to 6.x branch. Original message: 

Motivated by #8852 (comment).

Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to:
- run sub-pipelines via the [`pipeline` processor](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/pipeline-processor.html), and
- conditionally run processors via an [`if` field](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ingest-processors.html).

These abilities combined present the opportunity for a fileset to ingest the same _logical_ information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline.

This PR allows filesets to specify one or more ingest pipelines via the `ingest_pipeline` property in their `manifest.yml`. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline.

#### Example with multiple pipelines
```yaml
ingest_pipeline:
  - pipeline-ze-boss.json 
  - pipeline-plain.json
  - pipeline-json.json
```
#### Example with a single pipeline
_This is just to show that the existing functionality will continue to work as-is._
```yaml
ingest_pipeline: pipeline.json
```

Now, if the root pipeline wants to delegate processing to another pipeline, it must use a `pipeline` processor to do so. This processor's `name` field will need to reference the other pipeline by its name. To ensure correct referencing, the `name` field must be specified as follows:

```json
{
  "pipeline" : {
    "name": "{< IngestPipeline "pipeline-plain" >}"
  }
}
```

This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with `filebeat-<version>-<module>-<fileset>-` when loading them into Elasticsearch.
ycombinator added a commit that referenced this pull request Jan 13, 2019
… 6.5 (#10001)

Follow up to #8914.

In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0.

This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.
ycombinator added a commit that referenced this pull request Jan 14, 2019
…nes is being used with ES < 6.5 (#10038)

Cherry-pick of PR #10001 to 6.x branch. Original message: 

Follow up to #8914.

In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0.

This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants