-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to easily add custom pipelines and templates per integration, currently it is done per dataset. #146792
Comments
Pinging @elastic/fleet (Team:Fleet) |
Lets split the problem up into two parts:
There is a chance that the same solution might work for both but lets discuss it separately. Extension of integrations with pipelines and templatesMy understanding the flow you would like to see is:
Same for ingest pipelines. The ones you add, do processing after the integration did its work. It reminds me a bit of elastic/elasticsearch#61185. For the custom templates, to these apply to all dataset in an integration or do you have different ones for metrics vs logs as an example? Lets assume for a moment we could offer the following flow. I'll describe it as UI bits but of course would have to available through an API.
When integrations are updated, all your changes stay in place. Same applies for the ingest pipelines. Side note: In case we move forward with elastic/elasticsearch#85692 (comment) eventually, it would hopefully remove your need to maintain your own ECS component templates but it does not get rid of the problem. ILM policy managementThere have been quite a few discussions internally on this and there are several solutions we discussed. Before diving into solutions, let me ask some more questions:
@leandrojmp Thanks for filing and your integration contributions! |
Hello, @ruflin, Since ILM policies are controlled by the What I would expected to be able when adding an integration is:
One of the issues is that it doesn't look like that there is an standard on how the integrations works, some integrations have one ingest pipeline that call other ingest pipelines depending on how the data looks like, other integrations seems to split the data in multiple datasets and use multiples ingest pipelines and templates, this makes very hard to organize things. Also, splitting the data from an integration in multiples datasets and multiple data streams sometimes can result in many small indices, one of the main recommendations from Elastic is to avoid small indices, yet the integrations and many internal indices go against this recommendation. One example that I have is the Cisco Duo Integration, it uses the Cisco Duo API to collect 5 types of logs:
In this integration Elastic choose to use a different template, data stream and ingest pipeline to each one of these types of logs, if I want to add one of the custom fields that I have in all of our indices, I would need to edit at least 5 templates and 5 ingest pipelines. And there is the issue when you update an integration, the old ingest pipelines are not deleted, so you may end with a lot of unused ingest pipelines that you need to manually remove to make things more organized. |
Do you have some examples here? In general I expect all integrations to work with multiple datasets (if there are of course multiple datasets) and have ingest pipelines and templates for each. This exactly makes your "per integration" goal tricky.
This is partially changing. Elasticsearch struggled historically with too many shards but since the introduction of the data stream naming scheme many improvements have been made on this front. @jpountz Do we have any more public info this somewhere?
The interesting part here is that these are all logs datasets. I agree with you, we should offer you an easy way to apply a common template / pipeline for all 5 (or more) without having to do all the additional API calls.
Are you referring to your ingest pipelines here or the ones managed by the integration? The ones from the integration should be cleaned up AFAIK. @kpollich Appreciate all the details you provided. In summary, in most scenarios you look for "management per integration" (like you described in the title) and care less about the underlying datasets. I assume if there would also be metrics dataset, it would be likely grouped into types of data, logs and metrics that need separate updating. |
Ingest pipelines tied to a previous version of a since-upgraded integration are removed during the upgrade process, that is correct. |
Yeah, you are right @ruflin , my mistake, I compared with the Crowdstrike integartion that have a couple of ingest pipelines, but in the end it has only just one dataset, so if an integration has multiple datasets, it will have one datastream, one ingest pipeline and one template per dataset, which is the current issue.
custom ingest pipelinesI was thinking about it today because I added a Google Workspace integration that has 8 differente datasets, which means 8 more custom ingest pipelines and custom templates to manage. Currently the custom ingest pipelines works by having a
In this case the integration is the To make this work for every dataset in this integration without breaking the current behavior every ingest pipeline of the integration could have an extra pipeline processor.
Then the ingest pipeline Another way would be to have only the
With this you do not break the current behavior and makes it easier to add custom processors per both integration and dataset. custom component templatesA similar approach would also work for the component templates, currently the custom mappings uses a template called You could have a I could also add a setting for the So maybe the best option would be to allow to choose and already existing component template, regardless of the name, and an already existing ILM policy, while adding the Integration, then this would automatically add the component template in the managed templates and change the default ILM policy to the custom one. duplicates ingest pipelines@kpollich I've got some leftover in this case here, but will manually remove the old ingest pipeline. |
This is an interesting approach you are proposing here. I especially like it as it follows the logic of how we handle and name ingest pipelines and templates at the moment. Internally we had very similar discussions around how to enable namespace specific configs and pipelines so we would have @leandrojmp I was looking for some docs around "many shards is more acceptable now" and found the following blog post: https://www.elastic.co/blog/three-ways-improved-elasticsearch-scalability Hope this helps to explain a bit more in detail why the data stream naming scheme is ok. |
Adding a link to #121118 as I was previously talking about the counter part to the above discussion is the namespace specific feature discussion (more granular). Ideally, these two concepts follow the same underlying logic. |
@ruflin I think that I read this post in the past but forget about it, there is also this blog post about the number of shards that was updated since the rule of thumb of 20 shards per 1 GB of HEAP is not valid anymore. But old habits die hard and those two things where the truth for so long that sometimes is hard to not care about them. What I need, and proposed in the previous comment, is to have a way to add custom mappings and custom ingest pipelines in the integration level, currently you can only do it in the dataset level and some integrations can have multiple datasets, which multiplies the number of files you need to manage. This discussion about having custom templates/pipelines per namespace does not change much on how things are now, but it seems that it would add an way to have the same integration running multiple times in different namespaces and use different custom templates/pipelines for each namespace, it seems that it adds more granularity, but doesn't make it easier to manage a lot of custom templates and pipelines, on contrary, it adds more things to manage. |
Yeah, we've talked about doing something like this before. There's levels of granualrity that may be desired: global, per-type (eg. It starts to get really complicated really fast from a UX perspective if we add all component templates for all of these permutations out-of-the-box. IMO we should prioritize adding the most commonly requested ones within the existing UX and then explore a more tailored UI that allows the user to choose which level they want this applied to and then the appropriate templates are generated on-demand, rather than having many hundreds of templates that are there for use, but empty. |
Yes, and we need to communicate it more on our end.
Agree, I more brought it up in the context of having "common" conventions but it seems we are already aligned on this.
@joshdover Besides the UX challenge, the problem here is that currently all component templates are required to exist? There is no |
Just saw you wrote this just before me #146804 (comment). I will open the ES issue |
Just curious, besides the We are currently starting to use more integrations and planning to use the Elastic Agent/Endpoint to collect logs on our hosts to replace another log collector. But it seems that all the management issues are still the same. Is there at least an easier way to use separate ILM policies for different integrations? We have some integrations that does not generate 50 GB per year and others that generate 50 GB per week, but Elastic considers both the same since all integrations use the same lifecycle policy. Not being able to use different ILM policies for different integrations in an easy way really makes the management of data streams really complicated, is something like this planned? |
It seems that the following issues may help solve some of the current problems with management of Elastic Agent Integrations. |
Hi @leandrojmp!
This can be done on an dataset basis today by adding a custom lifecycle policy to the associated The issues you mentioned will help provide more of the underlying plumbing to be able to make customizations at other levels, rather than only at the dataset level. We still do need to add easier ways to make customizations at any level, such as providing a focused UI that centered on customizing data streams. |
Hello @joshdover, Yeah, the way this is done today is on the dataset level, I mentioned this earlier in this issue and this is the reason for this feature request, unfortunately doing customizations on the dataset level is impractical in production for many reasons. For example, the Google Workspace has 14 datasets, to customize anything you would need to clone and manage 14 templates, and this is just for one integration. The customizations needs to be done at least on the integration level to be useful. Just wanted to know if this is still in the roadmap, the new issues linked made clear that they are indeed in the roadmap. |
Hello, Do we have any update on this after one year? What are the issues tracking the improvements? |
@leandrojmp We added support for custom pipelines at the package, type, and global levels. This is shipping in 8.12: #170270 We have not yet added support for customizing settings and mappings at the integration / package level. We need to make a decision still on whether or not we will be able to support this at the Elasticsearch level in a generic way that isn't specific to only Fleet integrations (elastic/elasticsearch#97664). If we decide not to pursue that in the short-term, we will likely prioritize #149484 to solve this for Fleet integrations. Unfortunately, I do not have a timeline to share at this time. |
With the pipeline aspect done, the remaining work for index settings and mappings is tracked in #149484 |
Hello,
Currently on my company we manage all of our logstash pipelines and indices templates, since we are migrating to version 8 we thought about using some Elastic Agent integrations to shift the time spent on managing the pipelines/templates to other tasks.
But when we started looking out on how the integrations works when you need to add custom processors or custom fields, we saw that we would spend more time mantaining the integrations with our custom processors and fields than what would we spend using our logstash pipelines and templates.
For example, today we have a couple of component templates where we have mappings for both ecs fields like
source.*
anddestination.*
and some custom fields likesiem.*
that we add in all of our indices, we also have some logstash pipelines and ingest pipelines that we simple drop in into the pipeline folder or add as aindex.final_pipeline
for some indices.The first integration we tested was the Cisco Duo integration, if we want to add a single custom field, like
siem.datacenter
we would need to create and edit at least 5 custom ingest pipelines and 5 custom component templates, because the way the integrations works is to have a pipeline/template per dataset, not per integration.Also, all integrations share the same lifecycle policy, which does not work in many cases, we cannot use the same lifecycle policy for an integration that would generate 500 MB/day and for another integration that would generate 200 GB/day, but to edit the lifecycle policy you need to edit the template which in the end would result in editing a lot of files.
I've created this issue after a chat with @ruflin in this another issue in the Elasticsearch repository.
I've also made two posts on discuss explaining better those issues, this one and this one.
Describe the feature:
The feature would allow to set a custom ingest pipeline, a custom template and a custom lifecycle policy per integration on Kibana while adding it in Fleet, currently there is no easy way to do it.
The text was updated successfully, but these errors were encountered: