Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre-configured “lowercase” normalizer #53882

Merged
merged 7 commits into from
Apr 3, 2020

Conversation

markharwood
Copy link
Contributor

Simplify the common scenario of wanting to lower-case values.
Closes #53872

@markharwood markharwood added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens labels Mar 20, 2020
@markharwood markharwood self-assigned this Mar 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really helpful addition. Overall, I think it would be good to add a test to make sure everything is 'wired up' as expected -- one option is to add a test case to KeywordFieldMapperTests that exercises the built-in lowercase normalizer.

It's possible that users already have a normalizer named lowercase configured in the settings. (Note that in the future we plan to ban users from defining analysis components with the same names as built in ones, but we currently allow this behavior: #22263). Some suggestions to help start the discussion on how this should be handled:

  • We should make sure that we at least don't error out in this case, since it could be a common set-up. Ideally, I think we'd prefer the user-defined normalizer so that there aren't any surprising changes in behavior during an upgrade.
  • We can add an entry to the migration documentation encouraging users to remove their custom-defined 'lowercase' normalizer in favor of using the built-in one, or to rename it.

@markharwood
Copy link
Contributor Author

markharwood commented Mar 30, 2020

Thanks for the review, @jtibshirani !

We should make sure that we at least don't error out in this case, since it could be a common set-up. Ideally, I think we'd prefer the user-defined normalizer so that there aren't any surprising changes in behavior during an upgrade.

I checked this with a pre-existing "lowercase" field that didn't actually lowercase (only ascii-folding). The behaviour was what I would have hoped for:

  • Pre-upgrade
    • Create test index with custom "lowercase" normalizer that only does ascii-folding (no lowercasing)
    • Create doc test/_doc/1 with value "Publiées"
    • Run terms agg on field returns Publiees
  • Post-upgrade
    • Create doc test/_doc/2 with value "Publiées 2"
    • Run terms agg on field returns Publiees and Publiees 2
    • Create test_oob index with new out-of-the-box "lowercase" normaliser rather than a custom analysis setting
    • Create doc test_oob/_doc/1 with value "Publiées 3"
    • Run terms agg on test_oob index returns "publiées 3"

@markharwood markharwood force-pushed the fix/53872 branch 4 times, most recently from 37a3d02 to 9f07520 Compare March 30, 2020 13:49
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this with a pre-existing "lowercase" field that didn't actually lowercase (only ascii-folding). The behaviour was what I would have hoped for

That behavior makes sense to me too. A couple last comments:

  • We could add a test to verify the behavior that we always prefer a user's custom analyzer definition. This would guard against accidental changes to the upgrade behavior that we want. Perhaps AnalysisRegistryTests would be a good place to add a check.
  • I think we should mention the change in the migration documentation. Otherwise users won't know that they can clean up the index settings and remove a custom lowercase analyzer.

Finally, I wonder if it's worth checking with the team that we're happy with this approach. It would set a precedent for adding future built-in analyzers (or perhaps there's already a precedent I don't know about?)

@markharwood
Copy link
Contributor Author

markharwood commented Mar 31, 2020

Thanks for the comments, Julie.
I may have overlooked something. I'm not sure what this code is for given this PR seems functional without me adding anything here.
I also checked that the pre-upgrade steps listed above still work if repeated post-upgrade (i.e. you can knowingly override lowercase built-in definition with an index-specific one. This works (until we choose to prevent this behaviour with #22263)

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment. Other than that it looks good to me, thanks for the all the iterations.

@markharwood markharwood force-pushed the fix/53872 branch 3 times, most recently from de9f454 to 5387a51 Compare April 1, 2020 15:36
@markharwood markharwood merged commit d83798f into elastic:master Apr 3, 2020
markharwood added a commit that referenced this pull request Apr 3, 2020
A pre-configured normalizer for lower-casing.
Closes #53872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens v7.8.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pre-built named normalizers
4 participants