Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default mappings for logs #88777

Closed
felixbarny opened this issue Jul 25, 2022 · 6 comments
Closed

Better default mappings for logs #88777

felixbarny opened this issue Jul 25, 2022 · 6 comments
Assignees
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@felixbarny
Copy link
Member

felixbarny commented Jul 25, 2022

There are several pitfalls when using the default mappings for logs (the logs-*-* index template):

  • Data loss
    • Mapping issues due to object vs scalar conflicts ("host": "foo", "host.name": "foo")
    • Mapping issues due to conflicting types ("foo": 42, "foo": "bar")
    • Mapping explosions
  • Ingestion and disk overhead because all fields are indexed by default, even for fields that are never or rarely searched by (for example process.argv)
@felixbarny felixbarny added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team labels Jul 25, 2022
@felixbarny felixbarny self-assigned this Jul 25, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@felixbarny felixbarny linked a pull request Jul 25, 2022 that will close this issue
@felixbarny
Copy link
Member Author

felixbarny commented Jul 25, 2022

Some ideas and options on how we can mitigate the challenges:


Mapping issues due to object vs scalar conflicts ("host": "foo", "host.name": foo)

Set subobjects: false to the root of the mapping
Pros:

  • Both host and host.name can be indexed and searched by

Cons:

  • Currently, Elasticsearch rejects documents that contain nested objects instead of transparently converting them to flattened fields. See also Automatically flatten objects when subobjects: false #88934
  • Exists queries on object fields are not possible (probably not a big issue?)
  • When synthetic source is enabled, the re-constructed source is completely flat (probably not a big issue?)

Make ignore_malformed ignore invalid object fields (#12366)

Pros:

  • Avoiding the subobject: false cons

Cons:

  • Depending on the order of document ingestion, either host or host.name will win and the other can't be indexed

Mapping issues due to conflicting types ("foo": 42, "foo": "bar")

Set ignore_malformed as an index-level default.


Mapping explosions
Ingestion and disk overhead because all fields are indexed by default, even for fields that are never or rarely searched by (for example process.argv)

Only index fields that are commonly used for searching or filtering. Either disable dynamic mappings or use dynamic: runtime.

See also #85692 (comment)

@felixbarny
Copy link
Member Author

@giladgal brought up the question whether we can leverage synthetic _source by default. That's definitely a question we should discuss in more detail.

One challenge is that we'd not want any dynamic field to be indexed, therefore, we can't set dynamic: true. When we set dynamic to false or runtime, the runtime field is relying on the _source to lookup the field. This not only makes it impossible to disable _source and still use runtime fields but _source-based runtime fields are also very slow.

But what if there was a way to customize the dynamic setting even more so dynamic fields are not indexed but either just stored or just added to doc_values.

Storing each value of the original doc may still consume less disk space compared to storing the full _source.

Not sure what that would mean for being able to query across conflicting types, though. Could you still index two docs like {"foo": 42} and {"foo": "bar"} and successfully find the second document by doing a query like foo:bar?

@javanna
Copy link
Member

javanna commented Aug 16, 2022

The fact that runtime fields rely on _source does not necessarily mean that _source cannot be synthetic. The feature that you are describing reminds me of dynamic templates. Though it sounds like what you are looking for is a less manual way to map all fields so that they are compatible with synthetic source? You would want to for instance enable doc_values whenever supported, but fallback to _source for field types that don't support doc_values. This is something that we have been considering for a while as a follow-up of the current synthetic source work.

@ruflin
Copy link
Contributor

ruflin commented Aug 17, 2022

I'm not sure using synthetic source would be the right solution here. In general when things go wrong, we should always keep the original _source of the event (ideally even before ingest pipeline(s)). Even thought synthetic source could reconstruct the data mostly, it might look different which I don't think is desirable in an error scenario.

So my take is, having "only" _source for most of the fields is actually a feature even thought it might mean more storage than synthetic source. Optimising storage can happen for "known" datasets where the original source might not be needed anymore.

@javanna
Copy link
Member

javanna commented Aug 22, 2022

I went through this issue one more time, and while I agree with the high-level "better default mappings for logs" goal, I think that we should get together and discuss goals and options, and once we have a clearer idea, open more focused issues against the ES repo. Closing for now.

@javanna javanna closed this as completed Aug 22, 2022
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants