Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a constant_keyword field. #49713

Merged
merged 29 commits into from
Mar 2, 2020

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Nov 29, 2019

This field is a specialization of the keyword field for the case when all
documents have the same value. It typically performs more efficiently than
keywords at query time by figuring out whether all or none of the documents
match at rewrite time, like term queries on _index.

The name is up for discussion. I liked including keyword in it, so that we
still have room for a constant_numeric in the future. However I'm unsure
whether to call it constant, singleton or something else, any opinions?

For this field there is a choice between

  1. accepting values in _source when they are equal to the value configured
    in mappings, but rejecting mapping updates
  2. rejecting values in _source but then allowing updates to the value that
    is configured in the mapping
    This commit implements option 1, so that it is possible to reindex from/to an
    index that has the field mapped as a keyword with no changes to the source.

This field is a specialization of the `keyword` field for the case when all
documents have the same value. It typically performs more efficiently than
keywords at query time by figuring out whether all or none of the documents
match at rewrite time, like `term` queries on `_index`.

The name is up for discussion. I liked including `keyword` in it, so that we
still have room for a `singleton_numeric` in the future. However I'm unsure
whether to call it `singleton`, `constant` or something else, any opinions?

For this field there is a choice between
 1. accepting values in `_source` when they are equal to the value configured
    in mappings, but rejecting mapping updates
 2. rejecting values in `_source` but then allowing updates to the value that
    is configured in the mapping
This commit implements option 1, so that it is possible to reindex from/to an
index that has the field mapped as a keyword with no changes to the source.
@jpountz jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.6.0 labels Nov 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@jpountz jpountz added the WIP label Nov 29, 2019
@jpountz jpountz changed the title Introduce a singleton_keyword field. Introduce a constant_keyword field. Nov 29, 2019
@jpountz jpountz removed the WIP label Dec 2, 2019
@jtibshirani
Copy link
Contributor

This looks like a helpful addition! My first impression is that there is some overlap with existing mapping options:

  • There are similarities to the _meta field -- an alternative approach would be to allow queries and aggregations on data in the _meta field. This could help in a set-up where each index corresponds to a different combination of metadata fields (like type-label1-label2). However it's not a good fit if the user needs the data available in the _source -- perhaps this is part of the use case?
  • We could have also added this as an option to the keyword field. I'm guessing you didn’t do this because it would be confusing/ messy... certain keyword options like eager_global_ordinals would no longer make sense.

Related to the above, would you be able to summarize the motivating use case? I think I understand it from looking over internal discussions, but it would be nice to verify (and a summary would be helpful for our external contributors taking a look).

@jpountz
Copy link
Contributor Author

jpountz commented Dec 3, 2019

Thanks for looking @jtibshirani. I added more documentation that should help answer your questions. Queries on _meta could be an interesting idea, but I think it addresses a different need. One major benefit here is that constant_keyword fields look like real fields and can be easily be used in index topologies where some indices have the field mapped as a keyword and other indices as a constant_keyword.

I have a preference for a separate field over an option on keyword for the reasons you mentioned. In my opinion, most keyword options don't make sense anymore as soon as you have a single value, e.g. index, doc_values, ...

@jtibshirani
Copy link
Contributor

jtibshirani commented Dec 5, 2019

Thanks @jpountz, I understand the motivation better now. I could also see this field type being useful when dealing with the 'document type' migration: perhaps a user had a 5.x index containing two document types, then they separated each type into its own index in 6.x. Modelling the type information as a constant_keyword would let them refer to the type in searches + aggregations in an inexpensive way.

Some other high-level thoughts before I work on a detailed review:

  • I personally like name constant_keyword, it’s accurate and simple. It leaves the door open for other field types like constant_date_range (although I would have to think harder about whether that makes sense :))
  • If the intention is to be able to search across a field mapped as keyword and constant_keyword in different indices, should we update the field capabilities response to present constant_keyword as keyword? This made me think more about what the field caps response actually represents -- maybe it doesn't claim to return the mapping type exactly, but rather what simple type it acts like during a search.
  • We are lenient about whether the field must be specified in the _source. This was initially confusing to me -- at first glance I thought that certain documents would be allowed to not contain the keyword at all. Is there a good use case for this leniency? Maybe the user reindexes from a keyword, but then going forward wants to drop the value from the _source?

@jpountz
Copy link
Contributor Author

jpountz commented Dec 5, 2019

should we update the field capabilities response to present constant_keyword as keyword?

Hmm I thought I had raised this question on this PR, but I forgot. I agree this is a good question. One way to handle this would be to let clients know that a keyword vs constant_keyword conflict is not a problem, and the other way that you are suggesting is to use keyword as a type in _field_caps for constant_keyword. The latter is maybe a bit more user-friendly, but also gives less information to the clients. There is a precedent to this, which is that we are not collapsing the long, double, ... types to a single numeric type even though they can generally be searched and aggregated together. I'm a bit undecided on this issue and am leaning towards keeping it its own type in the _field_caps response, but could be convinced otherwise.

We are lenient about whether the field must be specified in the _source. This was initially confusing to me -- at first glance I thought that certain documents would be allowed to not contain the keyword at all. Is there a good use case for this leniency?

I raised this question to @clintongormley. For the use-cases we started envisioning for this field, like a data_type field that could take either logs or metrics as a value, we'd like to have the field in the _source all the time, because it might be used for detection rules, routing (not Elasticsearch routing, but rather for data shippers to know to which index to send the data), etc. Ideally Elasticsearch could do this automatically for constant_keyword fields, ie. remove the field from source just before indexing and adding it back in search results, but it would require a much bigger change.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 5, 2019

@timroes mentioned to me that it doesn't matter much to Kibana whether this field is exposed as a keyword or a constant_keyword in _field_caps. It would be easy for Kibana to handle this on top of the _field_caps API.

Thinking more about this, I wonder whether we will want to handle keyword and constant_keyword differently in the context of #49264 (though I don't have concrete ideas), so it might be more future-proof to return constant_keyword fields as constant_keyword rather than keyword in _field_caps.

@colings86
Copy link
Contributor

accepting values in _source when they are equal to the value configured
in mappings, but rejecting mapping updates

My vote would be for this option. The reason being, I may decide to split my data so I have all my event.type: foo data in indexes prefixed with foo. However I may not be able to control whether the ingest source includes that field in the documents it ingests. For example if I am using FooBeat, it will include the event.type field whether I want it to or not. But I would still want to define a constant field for event.type so I can take advantage of the efficiencies at query time (and to save a small amount of disk space).

So I think we need the field to accept values in _source when they are equal to the value configured otherwise it might make the feature somewhat hard to use in practice. Another benefit this gives is that if I accidentally point BarBeat at the same index I will get rejections and know that I've made a mistake.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 28, 2020

I'm wondering if we have other field types that perform 'autoconfiguration' like this? It seems that currently our dynamic mapping updates always add a new field, instead of modifying the mappings of an existing one. I'm just curious, I couldn't think of special issues/ considerations that might pop up with this 'autoconfiguration' behavior.

We only use it to introduce new fields today indeed, though it's not so different in that an introduction of a new field is a modification of the parent object field.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 28, 2020

@jtibshirani You said the change looks good to you, but didn't approve it. Let me know if there are other things that you would like to discuss regarding this change?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wanted to double-check one more thing (which I just did). Approved!

@jpountz
Copy link
Contributor Author

jpountz commented Feb 28, 2020

Thanks!

@jpountz
Copy link
Contributor Author

jpountz commented Mar 2, 2020

@elasticmachine run elasticsearch-ci/2

@jpountz jpountz merged commit 0b2a628 into elastic:master Mar 2, 2020
@jpountz jpountz deleted the feature/singleton_keyword_field branch March 2, 2020 15:46
@ruflin
Copy link
Contributor

ruflin commented Mar 2, 2020

@jpountz Thanks for making this happening. Looking forward to play around with it!
@neptunian This will be useful for our templates!

@jpountz
Copy link
Contributor Author

jpountz commented Mar 2, 2020

@ruflin You're welcome. Note that the backport is still pending in case you plan to play with a 7.x snapshot.

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Mar 2, 2020
This field is a specialization of the `keyword` field for the case when all
documents have the same value. It typically performs more efficiently than
keywords at query time by figuring out whether all or none of the documents
match at rewrite time, like `term` queries on `_index`.

The name is up for discussion. I liked including `keyword` in it, so that we
still have room for a `singleton_numeric` in the future. However I'm unsure
whether to call it `singleton`, `constant` or something else, any opinions?

For this field there is a choice between
 1. accepting values in `_source` when they are equal to the value configured
    in mappings, but rejecting mapping updates
 2. rejecting values in `_source` but then allowing updates to the value that
    is configured in the mapping
This commit implements option 1, so that it is possible to reindex from/to an
index that has the field mapped as a keyword with no changes to the source.

Backport of elastic#49713
jpountz added a commit that referenced this pull request Mar 3, 2020
This field is a specialization of the `keyword` field for the case when all
documents have the same value. It typically performs more efficiently than
keywords at query time by figuring out whether all or none of the documents
match at rewrite time, like `term` queries on `_index`.

The name is up for discussion. I liked including `keyword` in it, so that we
still have room for a `singleton_numeric` in the future. However I'm unsure
whether to call it `singleton`, `constant` or something else, any opinions?

For this field there is a choice between
 1. accepting values in `_source` when they are equal to the value configured
    in mappings, but rejecting mapping updates
 2. rejecting values in `_source` but then allowing updates to the value that
    is configured in the mapping
This commit implements option 1, so that it is possible to reindex from/to an
index that has the field mapped as a keyword with no changes to the source.

Backport of #49713
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Mar 3, 2020
These tests can be enabled now that the change has been backported.
jpountz added a commit that referenced this pull request Mar 3, 2020
These tests can be enabled now that the change has been backported.
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Apr 15, 2020
Relates: elastic/elasticsearch#49713

This commit adds the ConstantKeyword property to the client.
Value is exposed as type Object as it can be a string or numeric
value.
@abraxxa
Copy link

abraxxa commented May 14, 2020

I hope a late comment is better than none: to me 'constant' means the value can't change, like read-only after the document was indexed, not that all documents of an index have the same value. In that case why store it at all?

@jpountz
Copy link
Contributor Author

jpountz commented May 14, 2020

Do you have suggestions for a better name?

@abraxxa
Copy link

abraxxa commented May 14, 2020

I'm no native English speaker but 'equal' and 'same' are the terms that come up in translators that I would understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants