Add pre-configured “lowercase” normalizer #53882

markharwood · 2020-03-20T15:57:23Z

Simplify the common scenario of wanting to lower-case values.
Closes #53872

elasticmachine · 2020-03-20T15:57:25Z

Pinging @elastic/es-search (:Search/Mapping)

jtibshirani

This seems like a really helpful addition. Overall, I think it would be good to add a test to make sure everything is 'wired up' as expected -- one option is to add a test case to KeywordFieldMapperTests that exercises the built-in lowercase normalizer.

It's possible that users already have a normalizer named lowercase configured in the settings. (Note that in the future we plan to ban users from defining analysis components with the same names as built in ones, but we currently allow this behavior: #22263). Some suggestions to help start the discussion on how this should be handled:

We should make sure that we at least don't error out in this case, since it could be a common set-up. Ideally, I think we'd prefer the user-defined normalizer so that there aren't any surprising changes in behavior during an upgrade.
We can add an entry to the migration documentation encouraging users to remove their custom-defined 'lowercase' normalizer in favor of using the built-in one, or to rename it.

server/src/main/java/org/elasticsearch/index/analysis/LowercaseNormalizer.java

docs/reference/analysis/normalizers.asciidoc

markharwood · 2020-03-30T10:24:59Z

Thanks for the review, @jtibshirani !

We should make sure that we at least don't error out in this case, since it could be a common set-up. Ideally, I think we'd prefer the user-defined normalizer so that there aren't any surprising changes in behavior during an upgrade.

I checked this with a pre-existing "lowercase" field that didn't actually lowercase (only ascii-folding). The behaviour was what I would have hoped for:

Pre-upgrade
- Create test index with custom "lowercase" normalizer that only does ascii-folding (no lowercasing)
- Create doc test/_doc/1 with value "Publiées"
- Run terms agg on field returns Publiees
Post-upgrade
- Create doc test/_doc/2 with value "Publiées 2"
- Run terms agg on field returns Publiees and Publiees 2
- Create test_oob index with new out-of-the-box "lowercase" normaliser rather than a custom analysis setting
- Create doc test_oob/_doc/1 with value "Publiées 3"
- Run terms agg on test_oob index returns "publiées 3"

jtibshirani

I checked this with a pre-existing "lowercase" field that didn't actually lowercase (only ascii-folding). The behaviour was what I would have hoped for

That behavior makes sense to me too. A couple last comments:

We could add a test to verify the behavior that we always prefer a user's custom analyzer definition. This would guard against accidental changes to the upgrade behavior that we want. Perhaps AnalysisRegistryTests would be a good place to add a check.
I think we should mention the change in the migration documentation. Otherwise users won't know that they can clean up the index settings and remove a custom lowercase analyzer.

Finally, I wonder if it's worth checking with the team that we're happy with this approach. It would set a precedent for adding future built-in analyzers (or perhaps there's already a precedent I don't know about?)

markharwood · 2020-03-31T10:54:36Z

Thanks for the comments, Julie.
I may have overlooked something. I'm not sure what this code is for given this PR seems functional without me adding anything here.
I also checked that the pre-upgrade steps listed above still work if repeated post-upgrade (i.e. you can knowingly override lowercase built-in definition with an index-specific one. This works (until we choose to prevent this behaviour with #22263)

jtibshirani

I left one comment. Other than that it looks good to me, thanks for the all the iterations.

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

Closes elastic#53872

…perfluous TODO.

…sRegistryTests

…first in 7x

A pre-configured normalizer for lower-casing. Closes #53872

markharwood added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens labels Mar 20, 2020

markharwood self-assigned this Mar 20, 2020

markharwood force-pushed the fix/53872 branch from a67a9a7 to 562a730 Compare March 20, 2020 17:41

jtibshirani reviewed Mar 23, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/index/analysis/LowercaseNormalizer.java Outdated Show resolved Hide resolved

docs/reference/analysis/normalizers.asciidoc Outdated Show resolved Hide resolved

markharwood force-pushed the fix/53872 branch 4 times, most recently from 37a3d02 to 9f07520 Compare March 30, 2020 13:49

jtibshirani reviewed Mar 30, 2020

View reviewed changes

jtibshirani reviewed Mar 31, 2020

View reviewed changes

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java Outdated Show resolved Hide resolved

markharwood force-pushed the fix/53872 branch 3 times, most recently from de9f454 to 5387a51 Compare April 1, 2020 15:36

jtibshirani approved these changes Apr 1, 2020

View reviewed changes

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java Outdated Show resolved Hide resolved

markharwood force-pushed the fix/53872 branch from 5387a51 to 5a411ee Compare April 1, 2020 16:18

markharwood added the v8.0.0 label Apr 1, 2020

markharwood force-pushed the fix/53872 branch from 65798a6 to e8e7f83 Compare April 2, 2020 16:22

markharwood added the v7.8.0 label Apr 2, 2020

markharwood mentioned this pull request Apr 2, 2020

Custom analysis can't have same names as built-in #43252

Open

markharwood added 6 commits April 3, 2020 09:06

Add pre-configured “lowercase” normalizer

56731ca

Closes elastic#53872

Doc changes to reference new lowercase normalizer.

67da5d4

Address review comments - typo and test for new in-built normalizer

cb86000

Added name-clash test for normalizer, migration change and removed su…

b58b2b7

…perfluous TODO.

Shifted name-clash test logic from KeywordFieldMapperTests to Analysi…

e4962a7

…sRegistryTests

Whitespace fix

a6d319b

Reverse this comment now that we’re introducing lowercase normalizer …

88f5f72

…first in 7x

markharwood force-pushed the fix/53872 branch from e8e7f83 to 88f5f72 Compare April 3, 2020 08:07

markharwood merged commit d83798f into elastic:master Apr 3, 2020

markharwood added the backport pending label Apr 3, 2020

markharwood mentioned this pull request Apr 3, 2020

Backport of lowercase normalizer #54707

Merged

markharwood added a commit that referenced this pull request Apr 3, 2020

Backport of lowercase normalizer PR #53882

2da2305

A pre-configured normalizer for lower-casing. Closes #53872

markharwood removed the backport pending label Apr 3, 2020

russcam mentioned this pull request May 29, 2020

7.8.0 Meta ticket elastic/elasticsearch-net#4718

Closed

17 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-configured “lowercase” normalizer #53882

Add pre-configured “lowercase” normalizer #53882

markharwood commented Mar 20, 2020

elasticmachine commented Mar 20, 2020

jtibshirani left a comment

markharwood commented Mar 30, 2020 •

edited

Loading

jtibshirani left a comment

markharwood commented Mar 31, 2020 •

edited

Loading

jtibshirani left a comment

Add pre-configured “lowercase” normalizer #53882

Add pre-configured “lowercase” normalizer #53882

Conversation

markharwood commented Mar 20, 2020

elasticmachine commented Mar 20, 2020

jtibshirani left a comment

Choose a reason for hiding this comment

markharwood commented Mar 30, 2020 • edited Loading

jtibshirani left a comment

Choose a reason for hiding this comment

markharwood commented Mar 31, 2020 • edited Loading

jtibshirani left a comment

Choose a reason for hiding this comment

markharwood commented Mar 30, 2020 •

edited

Loading

markharwood commented Mar 31, 2020 •

edited

Loading