Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs+] Adding ECS dynamic templates #96171

Merged
merged 23 commits into from
Jun 7, 2023

Conversation

eyalkoren
Copy link
Contributor

Closes #95538

@elasticsearchmachine elasticsearchmachine added v8.9.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels May 16, 2023
@eyalkoren eyalkoren changed the title Adding ecs dynamic templates [Logs+] Adding ECS dynamic templates May 16, 2023
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective we cover too many fields in here by default. I would not map multi fields by default for example. To simplify the mappings, I think we can be even more generic on *.{field} and _{field}. Like this, ECS becomes more a pattern then specific fields. The nice part is, if we have *.ip as keyword, it makes sure that also foo.ip is a keyword by default.

There are some key fields we don't map like host.name because these are keyword. I would still map these specifically, to ensure nobody is indexing host at first. As soon as we have subobject: true, this becomes less of a problem.

@eyalkoren
Copy link
Contributor Author

@ruflin Thanks for all the input!
I'll go over the comments one by one, investigate a bit and address them.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented May 17, 2023

I would not map multi fields by default for example

In general? For specific ones?
I think there are clear cases where the user experience for logs use case will be highly impacted by this. For example, I think that multi-field mapping for fields ending with .name makes a lot of sense. Take thread name as one example- it's very likely that you'd want to search through your logs for something on a specific thread, or anything that happens on a thread pool (with threads named with a pattern, like thread-pool-0, thread-pool-1 and so forth. EDIT: even if this is a pattern that is searchable by keywords, assume some name that the user roughly remembers and wants to search for). I could make the same argument for service names.
Similarly, there are other fields where this logic may apply.

The nice part is, if we have *.ip as keyword, it makes sure that also foo.ip is a keyword by default.

Wouldn't "match": "ip" cover this?

@eyalkoren
Copy link
Contributor Author

Summarizing my discussions with @felixbarny and @ruflin:
Our aim is to provide ECS mappings component that:

  • is reusable
  • imposes minimal overhead
  • requires minimal maintenance effort

By relying on dynamic templates, we make sure we minimize the number of field mappings to only those that are actually encountered.
The other overhead we want to minimize is the number of dynamic templates that are managed within the cluster state, but with the ability to configure multiple names/patterns for each dynamic template, we can manage lots of mappings with very small number of entries. Therefore, we are dismissed from trying to minimize the names/patterns we use from ECS, meaning that we don't need to overthink which to include.
What this also means is that there is no value in splitting it into core and extended, as this will only increase the number of entries we'd need to use and maintain. Moreover, since we want to reduce maintenance effort, we're leaning towards using patterns, which may span fields from both core and extended, which will make such separation even less useful.

One thing I've noticed during our discussions is that there may be a problem with using a generic fallback mapping for all field of a specific type, such as:

{
  "strings_as_keyword": {
    "mapping": {
      "ignore_above": 1024,
      "type": "keyword"
    },
    "match_mapping_type": "string"
  }
}

If such is included within a component template that is used in combination with another component template that includes dynamic mappings, it may override its configurations. This is something I'll test, but there are probably two options:

  1. Elasticsearch does not allow multiple sources of dynamic templates, meaning it only allows one
  2. Multiple sources for dynamic templates are allowed, in which case the order matters

In both cases, we'll probably have to REPLACE data-streams-mappings.json with our new ECS dynamic templates.
The remaining question is whether we include a generic fallback mapping as the one above within the ECS templates (@ruflin wants us to, as it is currently considered a convention for ECS users) and if so- how do we allow users to add custom dynamic templates?

@eyalkoren
Copy link
Contributor Author

eyalkoren commented May 28, 2023

Conclusions from tests:
Dynamic templates that arrive from multiple component templates of a single composable template are concatenated to each other, based on their declaration order within the composed_of section , creating a single list of entries. Since a dynamic mapping for a field is determined based on the first matched name/pattern, having a generic fallback entry like described above would affect all following entries.
For example, if logs-template.json is composed of data-streams-mappings and ecs-dynamic-mappings, in this order, a field called *.test_ip will be matched to a keyword, because it will use the fallback and not a dynamic template that matches *_ip, as specified in ecs-dynamic-mappings. Reverting the order of these component templates' declaration within logs-template.json's composed_of array will ensure that *_ip fields are mapped to ip type.

Some options we have going forward (assuming that we don't want to change data-streams-mappings):

  • not use data-streams-mappings for logs and make sure that all of its configurations are covered by other logs component templates
  • not include a generic keyword-fallback mapping within the ECS dynamic templates and make sure that data-streams-mappings is used after the ECS dynamic templates for logs data streams

@eyalkoren
Copy link
Contributor Author

With regards to the following comment, I chose to go with the second option:

not use data-streams-mappings for logs and make sure that all of its configurations are covered by other logs component templates

This accommodates @ruflin's request to include a default fallback mapping of string fields to keyword.
Some differences we'll have when replacing the data-stream-mapping with the new ECS templates:

NOTE that by using the string-to-keyword default mapping, we introduce the same limitation we now encountered with the data-stream component template- anyone that would want to use the ECS template, will have to be aware of this default and the related limitation (not adding dynamic templates AFTER it).

I hope I addressed all requirements and queries so far. This PR is ready for review.
You can ignore the failing test for now. Essentially, CoreWithSecurityClientYamlTestSuiteIT fails with No mapper found for type [wildcard] when trying to parse the new template.
My guess is that something is missing from the cluster settings for this test to ensure that the wildcard data-type is enabled, however I can’t figure out what. I'll keep looking, also hoping for @jbaiera help on that. If anyone knows what needs to be explicitly enabled - I'll be happy to know.

@eyalkoren eyalkoren requested review from ruflin and felixbarny May 30, 2023 08:46
@eyalkoren eyalkoren self-assigned this May 30, 2023
@eyalkoren eyalkoren marked this pull request as ready for review May 30, 2023 08:47
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label May 30, 2023
@eyalkoren eyalkoren added the :Data Management/Data streams Data streams and their lifecycles label May 30, 2023
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels May 30, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @eyalkoren, I've created a changelog YAML for you.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM. As soon as this in, we should watch for any unexpected regressions. Few specific things I reviewed / checked:

  • Is the data-streams-mappings mapping still around? Yes, this ensures even though we don't use it, if someone referenced it, they keep having the same experience
  • No constant_keyword fields in the ECS mappings: There are non, it was taken out which is expected
  • Are all ECS fields mapped: The tool from Marius confirmed this.

I have a concern that we have not landed yet on a naming convention for templates and pipelines #96267 We are introducing 3 new component templates, good time to have one. The PR can go in and we can still change it as a follow up but it must happen before we release it.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM. As soon as this in, we should watch for any unexpected regressions. Few specific things I reviewed / checked:

  • Is the data-streams-mappings mapping still around? Yes, this ensures even though we don't use it, if someone referenced it, they keep having the same experience
  • No constant_keyword fields in the ECS mappings: There are non, it was taken out which is expected
  • Are all ECS fields mapped: The tool from Marius confirmed this.

I have a concern that we have not landed yet on a naming convention for templates and pipelines #96267 We are introducing 3 new component templates, good time to have one. The PR can go in and we can still change it as a follow up but it must happen before we release it.

@eyalkoren
Copy link
Contributor Author

Summarizing the feedback received offline:

  1. ECS mappings should not contain constant_keyword
  2. data streams should contain data_stream.* field explicit mappings to constant_keyword
  3. make templates as customizable as possible, allowing users to override default settings as much as possible
  4. the default string -> keyword mapping should be explicitly part of ECS
  5. restore "date_detection": false that was originally set through data-streams-mappings.json

38cbb38 and 150b0ab contain the implementation of this feedback.

Rationale:

  • a new component template contains the general data_stream field mappings to constant_keyword. It contains the three subfields explicitly so that the mappings are as complete AND specific as possible. I decided to use dynamic templates for them because this ensures that explicit mappings, like we have for data_stream.type, will have precedence, while still enforcing the correct mapping. The tests verify that all and only these three data_stream fields are mapped correctly and that the explicit data_stream.type is applied
  • the ECS mappings component contains the default string -> keyword as its last entry and it is added to logs-template.json as the last component, after logs@custom
  • after some thought, I decided that since "date_detection": false is not a feature of data streams, it needs to come in conjugation with the string -> keyword mapping, that implicitly disables it anyhow, so I added it to the ECS mappings

@ruflin @felixbarny please provide feedback on the above.
Once we finalize this, we will decide if we merge or block until we finalize naming convention.

@eyalkoren eyalkoren requested a review from ruflin June 4, 2023 17:28
@ruflin
Copy link
Contributor

ruflin commented Jun 5, 2023

To conclude the naming discussion, I put a comment in #96267 (comment) My specific proposal for this PR is:

    "logs-mappings" -> keep for legacy reasons, should be logs@mappings
    "data-streams-mappings" -> keep for legacy reasons, should be deprecated
    "ecs-data-stream-mappings" -> ecs@data_stream-mappings (see underscore because fields in there have one)
    "logs-settings" -> keep for legacy reasons, should be logs@settings
    "logs@custom", -> keep
    "ecs-dynamic-mappings" -> ecs@dynamic-mappings

@eyalkoren eyalkoren requested a review from felixbarny June 5, 2023 12:18
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to get this merged as is and have a quick follow up PR with ecs@dynamic_templates.

@eyalkoren eyalkoren requested review from dakrone and jbaiera June 6, 2023 06:54
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Logs+] Add default ECS mappings to logs-*-*
6 participants