Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of the box ECS field mappings for Custom Input packages #4236

Open
10 tasks
P1llus opened this issue Sep 20, 2022 · 21 comments
Open
10 tasks

Out of the box ECS field mappings for Custom Input packages #4236

P1llus opened this issue Sep 20, 2022 · 21 comments
Labels
discuss Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Security-Windows Platform Security Windows Platform Team [elastic/sec-windows-platform]

Comments

@P1llus
Copy link
Member

P1llus commented Sep 20, 2022

Currently the custom input packages (like TCP/UDP, httpjson etc) comes with the bare minimum of ECS mapping, very similar to how custom inputs worked in Filebeat, however this does not produce the best outcome for the end users, as functionality like add_*_metadata for example produces ECS fields, especially host which is enabled by default.

This issue is to discuss what the best practice would be for all Custom Input packages, and then used to track the status of applying any decided changes.

Packages:

  • httpjson
  • http_endpoint
  • eventhub
  • s3
  • tcp
  • udp
  • filestream
  • log
  • kafka
  • winlog
@elasticmachine
Copy link

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@P1llus P1llus changed the title Out of the box mappings for Custom Input packages Out of the box ECS field mappings for Custom Input packages Sep 20, 2022
@narph
Copy link
Contributor

narph commented Oct 18, 2022

@jsoriano , @andrewkroh can you chime in here?

@jsoriano
Copy link
Member

This is a long standing issue. We should have a way to include agent-specific mappings into the index templates of any input, or integration in general.

I don't think that these mappings belong to packages, as they are not produced by them. If these mappings are included in packages, any update on the agent or the processes it manages would require to update and release all packages, and users would need to update them. At some scale this is barely possible. But at this moment this is the only option we have.

In elastic/package-spec#63 and elastic/package-spec#199 something like an "Agent Common Schema" is proposed. This would be an schema including all common fields that an agent can generate, and Fleet would install it along with the mappings included in a given package.

@P1llus
Copy link
Member Author

P1llus commented Oct 19, 2022

@jsoriano I think that one of the issues is that filebeat itself would install these fields when using raw inputs if the processors was used, and it would be nice to be able to at least include the minimum fields.

Isn't there something we could do in the meantime? Or would that just cause more issues later down the line? It's either that or we should disable the add_host_metadata processor by default until we have it resolved, WDYT?

@andrewkroh
Copy link
Member

For as long as Agent is automatically including processors in config without the option of disabling them, then I think our input packages must generate mappings that include the fields produced by these processors. In effect that means including field definitions for fields produced by add_{host,cloud,kubernetes,docker}_metadata to all input packages. I think this needs to happen now even if we don't have solutions for avoiding duplication of the field definitions (these lists of fields are already copy/pasted across nearly every logging integration).


Longer term my preference for Agent is to never enable processors by default. They should always be opt-in whether that is by the integration developer (e.g. a conscious decision to always add specific processors into agent input config template) or by the end-user (e.g. enabling specific processors at integration config time via toggles exposed in the UI or via direct YAML specification). This would simplify the thinking around input configs and and put the developers and users in control of the data. Today you always have to account for these four magic processors that are always enabled.

I think we should solve the issues relating to management and maintenance of mappings for inputs and processors. This will make it easier to scale the number of integrations we maintain.

@jsoriano
Copy link
Member

For as long as Agent is automatically including processors in config without the option of disabling them, then I think our input packages must generate mappings that include the fields produced by these processors. In effect that means including field definitions for fields produced by add_{host,cloud,kubernetes,docker}_metadata to all input packages. I think this needs to happen now even if we don't have solutions for avoiding duplication of the field definitions (these lists of fields are already copy/pasted across nearly every logging integration).

Yeah, I agree that this is the only solution at the moment. I am not sure though if we should do much to support this, as we don't want it long term. The way to do this now is to copy and paste manually.

Perhaps a way to support this mid-term is to implement in elastic-package some kind of import mechanism as the one we have for ECS fields, but that include whole sets of fields. We would need to have the fields definitions of these processors somewhere, this could be the "Agent Common Schema" that has appeared in previous discussions, and would be also useful if later on we make the use of these processors an opt-in feature.

Longer term my preference for Agent is to never enable processors by default. They should always be opt-in whether that is by the integration developer (e.g. a conscious decision to always add specific processors into agent input config template) or by the end-user (e.g. enabling specific processors at integration config time via toggles exposed in the UI or via direct YAML specification).

Agree, but I think that this is not a decision to make by the integration developer. I don't see why a service integration may want some of these processors while others don't. I think that this is or a product decision (as is now), or a user decision, who chooses what metadata to add depending on their necessities and deployments.

This would simplify the thinking around input configs and and put the developers and users in control of the data. Today you always have to account for these four magic processors that are always enabled.

+1, this has to be out of packages development process.

I think we should solve the issues relating to management and maintenance of mappings for inputs and processors. This will make it easier to scale the number of integrations we maintain.

So maybe a plan is:

  • Define something like the "Agent Common Schema" as the source of truth for these mappings. I think we need this in any case.
  • Short/mid-term: implement something in elastic-package to import fields from this schema on build time, this could be initially hard-coded to the current list of included processors, so package developers "only" need to remove duplicated definitions.
  • Longer-term: users are able to select the processors they use from Fleet, mappings are installed by Fleet, and we stop shipping them in packages by default.

@andrewkroh wdyt?


we don't have solutions for avoiding duplication of the field definitions

Duplication of fields in data streams should be already detected when using format_version: 2.0.0, please let us know if this is not working.

@andrewkroh
Copy link
Member

@jsoriano Overall I like this plan.

For the short-term, I would say we should add mappings manually to these custom input packages for the fields that need non-default mappings (e.g host.ip, cloud.account.id (prevent it from being detected as a number)). This way we can address the field conflicts that users are experiencing today.

Mid-term, this sounds great to have sets of fields that can be imported. I would expect this will be useful in the long term as well because we could use it for fields associated with input types that are often reused (like import the fieldset for the "tcp" input or import the fieldset associated to the syslog beat processor).

Long-term, I like the idea of giving the control the user and putting Fleet in charge of the mapping. I can think of some things that make this complicated to manage, but I like the direction.


Duplication of fields in data streams should be already detected when using format_version: 2.0.0, please let us know if this is not working.

I meant duplication in the sense that we are cloning fields.yml files between integrations in order to "import" the set of fields that are associated to agent inputs and processors. Not about the same field being declared more than once. That detection is working.

@jsoriano
Copy link
Member

For the short-term, I would say we should add mappings manually to these custom input packages for the fields that need non-default mappings (e.g host.ip, cloud.account.id (prevent it from being detected as a number)). This way we can address the field conflicts that users are experiencing today.

Do you have a list of such mappings? If there are few of them maybe we can hard-code them by now in elastic-package (or Fleet) if this is a low hanging fix for current issues.

I meant duplication in the sense that we are cloning fields.yml files between integrations in order to "import" the set of fields that are associated to agent inputs and processors.

Ah ok, this would be solved by the proposed plan 👍

I will create the follow-up tasks to implement this.

@P1llus
Copy link
Member Author

P1llus commented Oct 26, 2022

Great progress! Thanks a lot for the nice feedback @jsoriano @andrewkroh .

I have 2 new input integrations planned, which I will put slightly on hold until it is resolved, so that we don't have any unnecessary changes so close after release.

Let me know if there is anywhere that I can help 👍

@jsoriano
Copy link
Member

Let me know if there is anywhere that I can help +1

It'd be great if you could help compiling the list of fields that would be good to include in the hard-coded workaround in elastic/elastic-package#1018.

Thanks!

@P1llus
Copy link
Member Author

P1llus commented Oct 26, 2022

Will put that on the todo list for tomorrow then 👍

@zez3
Copy link

zez3 commented Sep 20, 2023

#3151

related

@joshdover
Copy link
Contributor

@felixbarny Do you think this issue is solved by the ECS enhancements to the default logs-* template?

@felixbarny
Copy link
Member

I think this issue should be solved by that. It seems the missing piece would be to either remove the index templates for input packages (which we probably don't want due to the <package>@custom components and pipelines) or to import the ecs template. The latter should be doable but it means to require a minimum ES version and the ecs template will be versioned with ES, not with the input packages. I don't see an issue with that, however.

@zez3
Copy link

zez3 commented Oct 13, 2023

should be doable but it means to require a minimum ES version and the ecs template will be versioned with ES, not with the input packages. I don't see an issue with that, however.

Nice, so we can hope that in 8.11-12 we'll finally have this
@P1llus ?

@felixbarny
Copy link
Member

@zez3
Copy link

zez3 commented Oct 15, 2023

I'm on 8.10.3 and currently got some mapping conflicts with type long:

network.packets
network.bytes
http.response.status_code
server.port
client.port

In the proposed dynamic mapping, I do not see any mapping for type long

@felixbarny
Copy link
Member

The dynamic ECS mapping is not explicitly mapping fields that have a default mapping, such as number -> long, string -> keyword. This allows the mapping not be very compact and lightweight, was well as not needing a change every time a new field is added.

The tradeoff is that this doesn't guard against documents that come in with the wrong type. For example, a document containing network.bytes: true could skillset up an incorrect mapping. This is somewhat, although not fully, mitigated by ignore_malformed=true.

currently got some mapping conflicts with type long

How did the conflicts come to be? Was that a case of malformed data being shipped? What was the cause for that?

@zez3
Copy link

zez3 commented Oct 17, 2023

How did the conflicts come to be? Was that a case of malformed data being shipped? What was the cause for that?

I started using the new Custom Filestream Logs integration and after building the parser for my custom logs I forgot to add the fields to the logs-filestream.generic@custom component template. This is how I was doing before.
I mean for the other/older custom integrations like TCP UDP http endpoint journald where I also had conflicts in the past.
Now Ideally, I should have a global ...@custom component template and use that, per default (for all new upcoming custom integration), for all of my custom integrations.
@ruflin got some discussion going in the past if I remember correctly

getting back to your example network.bytes: true
If during the parser building phase I see such fields which are specific(long) already defined in ECS https://www.elastic.co/guide/en/ecs/current/ecs-network.html I would rename them to something else. Like network.bytes.enabled
I had such cases in the past, I will probably have in the future.

for the
http.response.status_code
server.port
client.port

from my point of view I could/would add them to the dynamic mapping

*.port and http.*.status_code will be 99% of the time a long number type

When I asked:

Nice, so we can hope that in 8.11-12 we'll finally have this

I was asking if this will be by default for all custom integrations available. Not needing me to do any import.

@zez3
Copy link

zez3 commented Oct 17, 2023

@kpollich Anything from your side?

@narph narph added Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Windows Platform Security Windows Platform Team [elastic/sec-windows-platform] and removed Team:Service-Integrations Label for the Service Integrations team labels Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Security-Windows Platform Security Windows Platform Team [elastic/sec-windows-platform]
Projects
None yet
Development

No branches or pull requests

8 participants