-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Avoiding duplication of ECS field definitions #63
Comments
/cc @ruflin @ph @ycombinator Let me transfer this issue to the package-spec repository, which is the right place to talk about the format changes. |
I'm not sure this affects the specification. Assuming that the build step used in |
It depends on the results of the discussion, we may consider to change the fields file format to contain references (pointers?) to ECS fields instead of duplicate entries. Regarding the build step, the long term plan is to get rid of the Magefile in integrations repo and 100% depend on https://github.com/elastic/elastic-package . Ideally the integrations will remain a repository with config files without any Go code inside. The builder tools will be totally separated from the source of integrations. |
@ycombinator and I had a discussion about possible mitigation of ECS duplication. One of the ideas we talked about bases on introducing a new JSON/YAML file in The |
If we add this to the package spec, I think it should not be a requirement. A dev that still wants to create all fields inside the dataset should be able to. I'm ok to make it a generic feature but would at first iterate in integrations on it. |
It must be in the package-spec to preserve the standard across all packages. It's not strictly the package specification, but the
Regarding the requirement, I'm rather fan of hard requirements which make the standard stricter. Let's say the
The feature will appear only in integrations as the |
I think we are talking about the same thing? If a developer builds a package outside integrations in repo "foo", elastic-package can be used but no |
Yes, the
I suppose so, but I'm wondering about the |
I agree with the idea of having an automated way to fetch ECS fields to reduce the manual work. By name or by glob would sound like a reasonable first step. Thanks for opening this, @andrewkroh! We might also want eventually to allow the removal of certain globs, as a second pass. E.g. include I also agree with @ruflin that using this mechanism should be optional. Some packages may have nothing to do with what ECS covers. Some packages might also want to override ECS fields. I think this is also acceptable. Legit reasons of doing so: clarifying a description in the context of this package, or adopting experimental ECS changes that aren't yet in the main ECS spec. How this gets implemented in the package build process is up to you. But it's worth pointing out that the tools used by the ECS team can already help accomplish some of this (see USAGE.md). If this is helpful, we're happy to help you adopt our tool to deliver this, add features you need & so on. If you'd rather implement this from scratch in the package build, let us know if we can help :-) |
This is spot on and one of the reasons we redefine it every time. It is not really about overwriting the field itself, but the information like title and description to make it specific to the dataset. |
It would be great to extend the idea to avoid duplication of any fields (ECS or non ECS) |
Let me propose a solution (healthy one). There are few problems around field duplication we need to address:
Solution: The solution consists of following items:
Field annotations:
Manifest properties:
Comments: Frankly speaking I would be for delegating responsibility of resolving dependency graph to the Kibana than including it Counter-argument to the build phase: Sure, we can resolve dependencies during the package build time, but it means that all packages in the Package Storage |
As you know, I strongly prefer to stay away from package dependencies as long as we can as in general a lot of complexity comes along with these things. @mtojek You mentioned in the beginning that |
I'm concerned about the scaling/maintenance problem here - it's the same situation like with every library version bump. I'm looking for seeing more comments around this puzzle. |
Solving the problem of pulling in a specific version of ECS seems much simpler and self contained then starting a create a dynamic dependency graph that needs live resolving in Kibana and much more error prone. It also brings along the problem, that a package installed today or tomorrow might be different because of a change in ECS and I don't think this should be the case. The same package installed should always be the same. For the ECS version resolution, it would be nice if by default a package creator does not have to think about it. The The next time a user works on the package and runs elastic-package build, ECS will be updated if there is a new version. There should also be an option for the user to specify a constant version of ECS to be used if needed. One odd thing that might happen is that the user built a package with a certain ECS version and when CI is then run, a different version was released and a diff happens. For this, a local file could be used to specify the ECS version used like other build tools do. |
The original package will remain the same, only it's dependency (same concept as library) will be bumped up. Anyway, I see that you're not in favor of introducing package dependenciees. I'm ok with this as I don't have capacity to craft alone something more sophisticated.
In fact this is an external dependency on the ECS repository and the "local file" you mentioned would be exactly a metafile describing the simple dependency graph. It would mean that we'll have build-time dependency, resolved by the elastic-package while processing fields. So the elastic-package would not only process Kibana dashboards, but also render fields. Any more comments around this? @jsoriano @exekias @andrewkroh |
I think we should differentiate two use cases were we may want to apply some magic to bring external fields:
For the first case I think that these fields are unexpected for the integrations developer, and they don't necessarily need to know about them, or be able to reproduce an scenario populating them. For instance, cloud fields will only appear when Agent is running in cloud, still we need to make sure they get the right mapping. I don't think we should be putting the pressure of adding these fields on the integrations developer, for instance, a change in the Agent would mean we need to update all integrations. For the second case, developers know about these fields as they are populating it, so here it would be a matter of providing shortcuts to avoid duplicating the effort of defining the mapping. What problem are we discussing here? 1) or 2)? Do we agree they are different problems with (potentially) different solutions? |
It would be nice if we can find a common/single solution. Actually the 1) is more complex - if the agent starts emitting a new field, we should update ALL packages which may use them, which IMHO is a "no-go" for build-time dependencies. |
Thanks @exekias for the clarification. I agree, these are two different problems, and even though we could think on an unifying solution I would prefer to try to find and implement simpler partial solutions to each of them, learn from the use-cases and based on the evolution of the solution, such as future requirements and evolutions (e.g., will integrations be able to add new processors at some point?) think on how to evolve / unify the solutions in the future if needed. I think the initial issue we are discussing here is (2). In this case, I think we want:
For this case a build time solution seems more than reasonable: the author references (one or more) field libraries (by URL) and decides which fields to import, with or without modification. Achieving reproducible builds (which we should try) implies just adding a hash of the imported content together with the URL. (1), as @mtojek points out, smells like requiring a run-time approach, as processors change with stack/agent versions, independently of packages. This would require more time and thought as there are more problems to solve (conflict resolution, updating mappings for existing packages at upgrade time, etc.) but besides thinking about it (because I think we need to go this route at some point) we would a shorter-term solution. And that can be just using the same approach that for problem (2), with a caveat: this solution will be manageable while most packages are maintained by us and in the same repo, as we can upgrade the packages with a stack release when needed. It would not be perfect, but it might be good enough to make progress. |
TL;DR:
Let's sketch an execution plan for this:
|
That sounds like a good approach to me. How would you expect the "includes" for processors to be specified? In the ECS case we are specifying the individual field names and importing the rest of the field's information from ECS ( |
@andrewkroh would you mind providing links to some existing use cases? It would be much easier for me to analyze these cases. |
Agent uses the default Filebeat config that forces any log dataset to include processors Each of those processors has a set of fields that they can populate. I don't think the Similarly if a dataset's agent config uses a processor then it would be nice to be able to simply include/import all the fields that this processor could populate. This is a less common use-case. The only example I can think of for this is the |
Thank you for the explanation, Andrew.
I'm afraid we'll have to write down these dependencies (processor with their fields) in some mapping file. As a first thought I would recommend to embed it in the proposed "Agent Field Schema".
The processor-fields mappings defined in "Agent Field Schema" and marked as default will be always included (which will result in fields files for processors). We can figure out it while flushing out the repository's draft. Speaking of custom processors, what do you think about introducing additional manifest property for a data stream -
It's a bit too verbose, but may we can reuse it in the UI for smarter navigation to docs (quickly jump to the processor doc page) |
Proposed plan (#63 (comment)) sounds good to me for ECS and initially for common agent fields, though I think that longer term a different solution is needed for agent fields. In principle all datasets can be used with composable providers (autodiscover), that include their own metadata, so with this approach all packages are going to need to include all these fields. Package developers shouldn't need to know what composable providers agent supports, so maybe For that the "Agent Common Schema" should include all the fields defined now in processors (also the x-pack ones for nomad and cloudfoundry), and for autodiscover providers, that also include metadata. Most autodiscover providers happen to don't have any A more future-proof alternative for agent fields can be to rethink them in a way that can be included under some dynamic mapping, so a fields deffinition is not needed for them. For example reserving |
I'm good with the approach. Lets focus first on ECS and continue the discussion on processor, autodiscover and agent fields in a follow up issue / proposal to go step by step. |
Thank you, Jaime for your input.
There is a tiny border between lazy field loading and approach we have in Beats (load-them-all), but totally I see your point.
We may hit the wall with backward-consistency against Beats, but this is also an option. In fact we lose all detailed descriptions here. Summary Looks like we have an agreement on the first part of the solution (ECS fields), so I will proceed with that part and ask you to think about other potential options for agent/autodiscovery/processors. I have feeling that finding a reasonable (best) solution for "autodiscovery" might be tough challenge. |
I thinkg we have to distinguish between short term and long term goals. Long term, there should be no need for an agent based schema. Right now Agent even adds an
What exactly should the Agent Fields Schema include? Just the fields currently added by libbeat (like TBH, I'm quite reluctant to make the Agent Team responsible for tracking changes to internal fields, potentially made by multiple teams. |
Ok, let me bring again the original use case:
We managed to introduce an option to standardize all ECS fields (same description, types, consistency). I believe the same actions should (can?) be performed for processors
I was thinking mainly about this meta: "agent": {
"ephemeral_id": "4d35807f-c708-46e6-97f3-b3369fbc34e8",
"hostname": "docker-fleet-agent",
"id": "d8213996-c24f-495c-96cb-f564b71a2762",
"name": "docker-fleet-agent",
"type": "filebeat",
"version": "7.14.0"
},
Out of curiosity - what about the above JSON?
Agree, but I'm afraid that switching to true input based approach is a long term. That's why we focused on processors and we're exploring options to describe the schema. A possible solution was described above: #63 (comment) . As there is no way to customize/enable additional processors in the Fleet, we may include it in the short term plan. I'm open to work out a solution/plan that will satisfy all our needs. |
I think we can discuss about ownerships later, if we decide to have an "Agent Fields Schema" (or however we call it) I agree that it doesn't need to be necessarily owned by the Agent team. We can leave this out of the discussion by now, let's focus on the technical solution.
Inputs and data providers are distributed with the agent, this would be the reason to couple the "Agent Fields Schema" with the Agent, even if the Agent doesn't include any field, components distributed with it do it. This "Schema" would include all the fields that can be added by all the inputs and data providers included in an Agent release. Integrations should be independent to this schema as they don't have any control on these fields, we would be coupling integrations to agent releases and this is something we want to avoid. Integrations should only include the mappings for the fields with the information they collect. Fields that are currently included in integrations, but provided by the agent (or the inputs or data providers distributed with it), as the mentioned |
As I see it, we have added a mechanism to add ECS fields (which is a "well-known" and "privileged" schema) and we are now looking for a similar way to add additional sets of fields. The fact that those fields currently come from processors or agent itself is an "implementation detail" so we should avoid naming that implies that, as it may change / evolve in the future. We could start with something like "Common Data Providers Schema" or something like that. Shouldn't we go directly for multiple schemas support? It would be identified by and URL (plus some hash for build reproducibility) and an integration package could just import fields from as many as needed. |
This way seems to be intuitive and flexible option, agree here. It isn't hard to extend current implementation.
I'm afraid we can't import every single field with the exported field definition. Processors may potentially add dozens of them. That's why I proposed "uses" directive in #63 (comment) . With "uses" we can mark that the package utilizes processors, so appropriate fields could be imported. Another option would be extending the build.yml file with other "Common Data Providers" with additional flag ( |
@andresrc but take into account that they are set of fields needed at different moments. The mechanism added for ECS fields is for integrations to more easily declare fields that they provide, this is something specific of each integration. The "Common Data Providers Schema" is not specific of integrations, it is specific of the "data providers", integrations don't have any control on these fields, and in my opinion they shouldn't have any reference to them.
SGTM.
+1 to support multiple schemas support in the external fields mechanism introduced for ECS, but it wouldn't solve the other issue.
This
This would force all integrations to maintain a list with all the possible data providers, in my opinion this is a no-go. Actually I think that packages or the the package-spec shouldn't include anything related to the fields provided by data providers. Maybe we should move this conversation to more specific issues. We could extract something like these different tasks:
|
So, there are two approaches here:
The main disadvantage of opt-in is that we need to explicitly list providers in integrations and update the package versions as those providers evolve. The main advantage is that everything can be solved at build time and it's deterministic (there are no surprises between what I built and transformations performed at runtime). The main advantage of opt-out is that there is higher decoupling between integration and provider development. The disadvantage is that conflicts can appear at runtime and that we need more runtime support. Because of this I think we should start with the opt-in approach:
|
Thank you for joining the discussion, it seems that we reached an agreement - a package My candidates:
|
One question: Can some of those providers reference ECS fields? If so, how would the reference work? Can we send an email with a summary of the final proposal? |
They have to refer to ECS fields, but no fields will be stored in resolved form. I wouldn't resolve them on this level to prevent dependency hell (packages uses ECS 1.10, some processors uses ECS 1.9, some ones uses ECS 1.8 - it's impossible to keep consistency between field definitions). I think we can adopt a similar format to packages - fields directory. Every YAML file represents a data provider (e.g. Beats processor), contains field definitions and references to ECS fields. When an integration
Sure I can do this. I will also highlight that format of |
@andresrc, for the data providers fields, this implies to copy all fields definitions of all the providers to all the packages, are we ok with this solution by now? For clarity, I am refering to the fields added by composable data providers, such as these ones: https://github.com/elastic/beats/blob/master/x-pack/elastic-agent/pkg/composable/providers/kubernetes/kubernetes.go#L103-L106 I agree that this solution is easier to implement (or looks easier), but this couples packages to agent releases, we should release all packages every time a new field is added to a composable provider included in the agent. And users should update all their integrations to these new versions every time an agent is upgraded to a version including new fields. |
I had a chat about the solution with @jsoriano . Jaime raised concerns around kubernetes.go, which means that there are some fields added directly by elastic-agent, so we can't just trust fields defined in integrations. Even if we document all common fields used by integrations, values/types for some of them may differ depending on the elastic-agent's version. @andresrc Is it an acknowledge risk or we should raise this issue to the Agent team? EDIT: Oh, it looks we posted it in the same time :) |
Let's send a summary of the current proposal to the whole team and continue the conversation there, thanks! |
One point I'm not totally clear on (apologies if this point was discussed, I may have missed it): why is it preferred that these fields be maintained in a separate repo rather than contributed to ecs itself? |
If the issue is e.g. syncing different Agents to different versions of ECS I would imagine that would be easier to manage by using the ECS toolset to generate &/or extend the desired field subsets, rather than creating these externally and then e.g. adding new tools to reconcile these against ECS |
So, for example, generating a new JSON/YAML file in <data_stream>/_dev directory with listed ECS fields looks interesting |
Currently all these fields are scattered across Beats code and most likely won't be good candidates to import 1:1 into ECS. Whenever a field is a good fit into the ECS, we'll go for it. Also, we need a place that we own, similarly to Beats, but with higher consistency rate.
I'm not quite sure if we like to follow this way. We have fields that are product oriented, e.g. Kubernetes, Docker. I'm not sure if these are good candidates.
This feature has been already delivered. Integrations can import fields directly from ECS (based on ECS version). If you're interested, here is the HOWTO guide. |
Please notice that this issue actually includes two issues (see this comment):
|
re Kubernetes, there is an completed RFC 0012: Orchestrator field set creation |
Yes, there will likely always be fields that are not candidates for ECS, and it also seems reasonable to have the list of fields in one place vs. scattered. However, I do worry that over time as new fields are considered and added to the "agent common schema", those fields may become less likely to make their way into ECS. But we also already have similar challenges today. For example, |
In general I like the deterministic approach we are following here. There is a concern I have about the number of fields this will add to each dataset and with it increase size of all templates and mapping. This multiplies quickly with many dataset and if we just add k8s fields to all integrations even if someone does not use it, it is not great. There might be a partial way out here: dynamic mappings. Instead of configuring all fields for k8s in the referenced fields it is a dynamic mapping that makes sure the fields are dynamically mapped correctly, most are keywords anyways so likely do not even need the mapping to be set as this is the default. One completely different alternative is to use more recent feature in Elasticsearch that the mapping can be sent as part of the request. Like this the creation of these mappings would be delegated to Beats as part of the pull request. But it would have to be investigated if this causes issues with the permissions. PS: I don't like that we keep mixing two discussions into a single issue. It keeps creating confusion. We should close this issue and have a separate one for the "current" discussion. |
Yes, this also worries me. I didn't want to add factors to the discussion, because this is in progress, but having many fields (and many sources of fields) may also complicate the controlled annotation of dimensions (elastic/elasticsearch#74014). We have also proposed different approaches based on dynamic templates and mappings during this thread (#63 (comment), #63 (comment), #63 (comment)). This would be actually the ideal situation because each data stream would only contain the mappings needed for the data it stores, we should dedicate some efforts to explore this line.
+1, and probably this discussion doesn't belong to the package spec, because as I have said before I don't think packages or integrations should have any reference to fields they don't control. |
@ruflin good idea about splitting the thread as ECS deduplication is done. Let's continue in the follow-up issue. |
Sorry to re-open, I've been reading through the thread in detail & there are a couple of things I'd like to understand better. I understand:
I'm not sure about
My concern is that we will move towards 2 distinct technologies for maintaining schemas. Perhaps more importantly than the field definitions themselves, if one of these technologies adds/removes a feature it may then be incumbent on the other technology to implement this, which adds, not removes more dependencies. The ECS Toolset currently provides for both a common schema AND extensions to that schema However, while the ECS repository currently accomodates Common Schema I fully understand there is a Gap - a repository to house Agent fields. So, here's my question - To prevent divergence (without getting into the detail of design &/or reuse) is it technically feasible that the ECS and Agent fields might both share tooling derived from a common parent? |
If you look into elastic/integrations, you will notice that in many cases developers tend to copy same ECS fields definitions (name, type, description, other properties) into multiple places. This way you can introduce some inconsistencies or errors. We needed an adjustment here, so we decided to improve our builder tool - elastic-package.
I'm happy to review the ECS toolset, didn't know about it, thanks! Please keep in mind that package development is relatively dynamic and we wouldn't like to get blocked, as some fields are waiting for an official ECS release. We're just trying to organize them on our end (as they're scattered now across the entire Beats repository - field validation is impossible). If you have any tools to create and manage field schema extensions in external repositories, please share them.
I can review the ECS toolset you're offering, but we have one principle/hard requirement on our end - elastic-package is the only tool a developer needs to build a package/integration. I'm not sure about a common parent. Our intention was introducing a lightweight solution to deduplicate field definitions and keep consistency, we definitely don't want to over-engineer here :) I hope you don't mind if I close the issue and we can continue the discussion in the linked follow-up - #199 . |
Currently each dataset duplicates the definitions of the ECS fields that it uses (data type, descriptions, examples, etc). This puts a burden on the package maintainers to copy the ECS definitions into the dataset and keep it in sync with ECS.
It would be simpler to develop and maintain a package if the dataset only required listing the names of ECS fields that the module uses (and the ECS version). When the package is build the full field definitions for the specified fields can be imported from ECS.
The dataset would declare:
host.*
, but that's part of the discussion)The build step would create field declarations in YAML format that conform the package spec.
The text was updated successfully, but these errors were encountered: