sub keyword field to string dynamic mappings - name and intent discussion #18195

djschny · 2016-05-07T02:10:36Z

As discussed with @jpountz in #17188 (comment) opening up a separate ticket for discussion here.

Some items for consideration:

Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch
By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-05-07T13:13:58Z

In the original issue (#12394) I went into great detail to explain the reasoning behind this change, but to address your questions here:

Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch

In the past, the string field could be used for full text search and for aggregations, by loading all the terms into the heap in fielddata. The behaviour of these fields depended largely on the type of value that was specified, eg "The quick brown fox..." implied the use of full text search (but not aggregations or sorting), while "London" might be a single identifier used for single-term lookups, aggregations and sorting. But "New York", which is probably intended for the second use case could actually only be used for the first.

We can't deduce which use case a user intends when we receive a string field - it could be either. The solution for this is to provide a main text field for full text search (with fielddata disabled so that users don't unwittingly flood their heap by trying to run aggregations or sorting on that field), and a sub-field of type keyword for the single-term lookup, sorting, and aggregations use case.

The benefit of this is that, without any config, you get both access patterns for string fields out of the box. The downside is that you index string values twice.This is exactly the same pattern that Logstash has used for string fields for a long time so users of Logstash are unlikely to see any change.

It is very easy to optimize disk space usage here: just map your fields as text or keyword or add a dynamic mapping for textwhich specifies whether a field should be only text or only keyword.

By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.

No we aren't. This field is not named after the keyword analyzer, it is named after the field type keyword. The field type got its name in the same way as the keyword analyzer did: we don't want full text, we want to treat this value as a single keyword. What other name would your recommend to describe the datatype for this field?

And keyword fields in the future will not be restricted to the keyword analyzer. We will add support for limited analysis which allows, eg lowercasing or performing unicode normalization, or unicode collations.

For me, the only debate is whether this sub-field should be called keyword or raw, which is the name used today in Logstash. For bwc, raw would probably be better, but I think that keyword is more descriptive. My current feeling is that we should continue to use keyword. Logstash is free to keep their index template which uses raw instead.

jpountz · 2016-05-08T15:29:21Z

+1 to what Clinton said. The fact that we did not map strings both for text search and keyword search/aggs in the past caused bad out-of-the-box experiences since you almost certainly had to reindex once you realized that you could not aggregate on whole string values.

Regarding disk usage, it will be higher with default mappings for sure, but the problem is mitigated by the use of ignore_above: 256. There is a trade-off for sure, but I think having to reindex to run aggregations is more disappointing than higher-than-expected disk usage.

However I'm also open to changing the name to either raw like logstash or original like @rjernst suggested. I have a slight preference for keyword though.

clintongormley · 2016-05-13T09:26:31Z

Discussed it in Fix it Friday - we prefer the keyword field. Logstash can continue to use raw with dynamic templates, should they so choose.

I will improve the docs to explain that we're optimising for the OOB experience, but disk usage can be improved with some simple mappings.

djschny · 2016-05-19T20:26:03Z

What other name would your recommend to describe the datatype for this field?

not_tokenized

jordansissel · 2016-08-04T21:52:02Z

Logstash can continue to use raw

Much of the road to 5.0 has been a theme of consistency. We've used raw for a long long time, and now are suddenly calling this thing keyword -- this is inconsistent. Logstash should not keep inconsistency and is looking at fixing that very soon, which is why I'm here talking about our new friend keyword. I do not believe Logstash can continue using raw because after 5.0 this becomes a user experience problem that ES uses keyword for strings where Logstash uses raw.

That said, for me personally, keyword is the wrong name. "United States" is two words, "San Jose Sharks" is three words, and yet the keyword name implies a singular word. A user agent string is even further something I would consider a keyword and yet I use Logstash's raw feature to allow me to do aggregations on user agents. My chief concerns on naming things is about how much I expect it to confuse users.

With the hands-on-workshop, I teach people about analyzers/tokenizers by showing what happens to a string by default in Elasticsearch, then we talk about treating these entire strings as a single field value (or "term"). Because we're on the topic of analyzers, it is easy to say "We solve this by using this thing called not_analyzed, and logstash calls this field the 'raw' value". It is early for this keyword feature, but I have trouble coming up with such a story for teaching.

dadoonet · 2016-08-05T03:59:17Z

And raw is a shorter name :)

I think consistency is a good point here.

But I'd like to be able to apply some token filters on this type of fields at some point so I don't think that having "raw" + an analyzer would make sense in term of meaning.
"Keyword" + an analyzer has more meaning IMO.

I think we should mark this discussion as a blocker for the next release because it will be hard to change after we released the beta.

jordansissel · 2016-08-08T20:50:58Z

I've been thinking the past few days how to find a way to convince myself that keyword is the right name. Here is the story on how I can explain to myself why keyword might be the right name:

I thought keyword was poor because I view Elasticsearch field mappings as a way to say "The data is of this type". This worked well for me to understand and explain various obvious-to-me data types in Elasticsearch such as dates, longs, floats, strings, etc.

In this model, I was telling Elasticsearch what the data is, and trying to distinguish strings vs keyword vs text was not fitting my mental model.

The Elasticsearch documentation on mappings says this:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.

In this description, it seems that the mapping is presented as how Elasticsearch uses the data, not what the data is. If I view things with the how in mind, instead of the what, I think keyword makes sense -- I can tell Elasticsearch how to treat something like "United States" (such as text or keyword).

The above explanation may be confusing, but I think I can use this model -- how instead of what -- to tell stories in trainings, etc, about reasons for using text vs keyword. "Treat it as a keyword", for example.

I am still nervous about the difficult schema change this will require on the Logstash side; in the battle for consistency, Logstash will want to change the multifield .raw to match what Elasticsearch uses: .keyword.

jpountz · 2016-08-09T07:16:24Z

If this proves to be a challenge to logstash, I'd personally be ok with keeping the field called keyword but having it named xxx.raw in the default mappings. Am I right to assume this is something you'd be happy with?

cdahlqvist · 2016-08-09T08:31:51Z

There are a lot of users with massive amounts of data ingested through Logstash where the current .raw field convention is used. Changing the default from .raw has the potential to unnecessarily break a lot of systems and cause problems for users using the default templates or custom index templates based on these. Please take this into consideration before deciding to change the existing .raw field naming convention.

jordansissel · 2016-08-09T23:34:03Z

@cdahlqvist We're discussing the options and impacts of .raw vs .keyword over on logstash-plugins/logstash-output-elasticsearch#462

I have a rough draft of a proposal here: logstash-plugins/logstash-output-elasticsearch#462 (comment)

jordansissel · 2016-08-09T23:35:42Z

@jpountz I'd be OK having ES's default to xxx.raw, yes. The benefit there is to not divide users across the release boundary of 5.0 (new users and old users would both get .raw if we did this)

clintongormley · 2016-08-10T11:38:15Z

@jordansissel I agree with the conclusion you reached in #18195 (comment) and I think that keyword is fundamentally the right name for this field (including for the reasons cited in #18195 (comment)). Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that (unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful transition in Logstash. I don't have great suggestions for how to make this easier, but the options are probably as follows:

New users - use keyword from the outset
Existing users with custom templates - most of these won't be much impacted
Existing users with short retention periods - could use raw and keyword for the duration of the transition
Existing users with long retention periods - could change the template to just use raw going forwards

jordansissel · 2016-08-10T14:28:55Z

+1 clint's comments and keeping 'keyword'.

I think we can help users through this period of transition. It may be
hard, but I think it's the right direction.

On Wednesday, August 10, 2016, Clinton Gormley [email protected]
wrote:

@jordansissel https://github.com/jordansissel I agree with the
conclusion you reached in #18195 (comment)
#18195 (comment)
and I think that keyword is fundamentally the right name for this field
(including for the reasons cited in #18195 (comment)
#18195 (comment)).
Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that
(unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful
transition in Logstash. I don't have great suggestions for how to make this
easier, but the options are probably as follows:

New users - use keyword from the outset

Existing users with custom templates - most of these won't be much
impacted

Existing users with short retention periods - could use raw and
keyword for the duration of the transition

Existing users with long retention periods - could change the
template to just use raw going forwards

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#18195 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIC6vUIeZey6EZJHL9KAaDqxjsgRugYks5qebhpgaJpZM4IZVF6
.

clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 7, 2016

clintongormley added >docs General docs changes and removed discuss labels May 13, 2016

clintongormley self-assigned this May 13, 2016

rjernst mentioned this issue Aug 4, 2016

The keyword name is confusing to me. #19812

Closed

clintongormley closed this as completed Aug 11, 2016

clintongormley mentioned this issue Aug 26, 2016

Field Aliases #17511

Closed

jordansissel mentioned this issue Sep 8, 2016

revert to use .raw instead of .keyword logstash-plugins/logstash-output-elasticsearch#474

Closed

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sub keyword field to string dynamic mappings - name and intent discussion #18195

sub keyword field to string dynamic mappings - name and intent discussion #18195

djschny commented May 7, 2016

clintongormley commented May 7, 2016

jpountz commented May 8, 2016

clintongormley commented May 13, 2016

djschny commented May 19, 2016

jordansissel commented Aug 4, 2016 •

edited

Loading

dadoonet commented Aug 5, 2016

jordansissel commented Aug 8, 2016 •

edited

Loading

jpountz commented Aug 9, 2016

cdahlqvist commented Aug 9, 2016

jordansissel commented Aug 9, 2016

jordansissel commented Aug 9, 2016

clintongormley commented Aug 10, 2016

jordansissel commented Aug 10, 2016

sub keyword field to string dynamic mappings - name and intent discussion #18195

sub keyword field to string dynamic mappings - name and intent discussion #18195

Comments

djschny commented May 7, 2016

clintongormley commented May 7, 2016

jpountz commented May 8, 2016

clintongormley commented May 13, 2016

djschny commented May 19, 2016

jordansissel commented Aug 4, 2016 • edited Loading

dadoonet commented Aug 5, 2016

jordansissel commented Aug 8, 2016 • edited Loading

jpountz commented Aug 9, 2016

cdahlqvist commented Aug 9, 2016

jordansissel commented Aug 9, 2016

jordansissel commented Aug 9, 2016

clintongormley commented Aug 10, 2016

jordansissel commented Aug 10, 2016

jordansissel commented Aug 4, 2016 •

edited

Loading

jordansissel commented Aug 8, 2016 •

edited

Loading