Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preserve_original setting in edge ngram token filter #55766

Conversation

amitmbm
Copy link
Contributor

@amitmbm amitmbm commented Apr 25, 2020

Raised this PR to address the bug #55767

@amitmbm amitmbm force-pushed the feature/expose-preserve-original-in-edge-ngram-token-filter branch from 1c4cf4e to ba598da Compare April 25, 2020 09:05
@astefan astefan added the :Search Relevance/Analysis How text is split into tokens label Apr 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@cbuescher cbuescher self-assigned this Apr 27, 2020
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @amitmbm, thanks for opening this PR, looks great. I only left a few very minor remarks around formatting etc., the rest is okay. I will enabling running the tests so everything should be run past CI once you push another commit.

package org.elasticsearch.analysis.common;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems unused, our checkstyle rules will complain about unused imports, so better to remove it now before running the tests.

Copy link
Contributor Author

@amitmbm amitmbm Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intelliJ removed unused import wasn't configured for elasticsearch project, enabled it now :)

@@ -173,6 +173,10 @@ See <<analysis-edgengram-tokenfilter-max-gram-limits>>.
(Optional, integer)
Minimum character length of a gram. Defaults to `1`.

`preserve_original`::
(Optional, boolean)
If set to true then it would also emit the original token. Defaults to `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wording might be better sth like "Emits original token then set to true. Defaults to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to Emits original token when set to true. Defaults to false. --> notice changed to when from then in the suggested edit.

/**
* Test class for edge_ngram token filter.
*
* @author Amit Khandelwal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we usually don't add @author tags to classes or test classes but rely on the commit history rather than code comments to track authors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed it.

* @author Amit Khandelwal
*/
public class EdgeNGramTokenFilterFactoryTests extends ESTokenStreamTestCase {
public void testDefault() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add newline befor first test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just observed this in so many other test classes and copy-pasted the initial test setup :). nvm removed this.

@cbuescher
Copy link
Member

@elasticmahine ok to test

@cbuescher
Copy link
Member

@elasticmachine run elasticsearch-ci/bwc

@amitmbm
Copy link
Contributor Author

amitmbm commented Apr 27, 2020

@cbuescher thanks for kicking another test try for elasticsearch-ci/bwc, I looked at the test failures and it was related to UpgradeClusterClientYamlTestSuiteIT class which no way related to the code I've written and seems got failure due to timeout.

@cbuescher
Copy link
Member

... which no way related to the code I've written

I agree, we'd still like to get a clean test run. Lets try this again.
@elasticmachine run elasticsearch-ci/bwc

@cbuescher
Copy link
Member

@elasticmachine update branch

@amitmbm
Copy link
Contributor Author

amitmbm commented Apr 27, 2020

@cbuescher looks like merging master into my feature branch fixed the test failures. Let me know if you can merge it if all looks OK.

@cbuescher cbuescher merged commit 9e41fed into elastic:master Apr 28, 2020
@cbuescher
Copy link
Member

Hi @amitmbm, I merged your change to master and will also port it to the latest 7.x branch. Thanks for picking this up.

cbuescher pushed a commit that referenced this pull request Apr 28, 2020
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.

Closes #55767
@amitmbm
Copy link
Contributor Author

amitmbm commented Apr 28, 2020

@cbuescher I'm really glad as it's my first commit merged to Elastic code base, I had raised another similar PR #55432 which is almost reviewed by your colleague Mark Harwood, but then there is no update on this PR from last 4 days. Hope he is safe and if you get time please look into this.

After this, I want to pick some more changes and one of them is deprecating XLowerCaseTokenizerFactory mentioned in
https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java#L372 Please let me know how if there is any documentation on the deprecation process at Elastic? So that I can pick this issue and several others related to deprecation.

@cbuescher
Copy link
Member

Thanks, great to hear you enjoyed working on the PR. We try to review user PRs in a timely manner but please don't expect anyone to respond to new commits etc... immediately because we all handle this differently and asynchronously.
Regarding deprecation processes: there is not one clear-cut approach, we generally aim at not changing / remove existing functionality in a minor version, and if we do so in a major version (e.g. 8.0) it is still preferred to provide a clear upgrade scenario, e.g. when removing a functionality, then we try to warn users on 7.x about the upcoming change of behaviour for example by returning warning messages with each http requerst and logging deprecation warnings. In the case that you mentioned, it's even a bit more complicated since existing indices (e.g. the ones from 7.x) still need to work with the analysis components used when they were created, so simply removing them on 8.0 isn't an option. We'd probably have to discuss the approach here in more detail on an issue.

@amitmbm
Copy link
Contributor Author

amitmbm commented Apr 28, 2020

@cbuescher I understand that Elastic as a whole company work in async mode and my intent is not to push my PRs for review, it was stuck so I thought to bring this to you notice. Anyway thanks a lot for explaining this and I would keep this in mind.

Also, reg. the deprecation changes, As you pointed out it requires more discussion, I would open a new issue and will discuss it there. Have a great day ahead 😄

@pugnascotia pugnascotia changed the title Feature/expose preserve original in edge ngram token filter Add preserve_original setting in edge ngram token filter May 7, 2020
@mmoreram
Copy link

mmoreram commented Mar 7, 2022

Do we have any update on this PR? Did you translate the discussion?
Could you please take a look again and consider merging?
Thanks!

@amitmbm
Copy link
Contributor Author

amitmbm commented Mar 8, 2022

@mmoreram I raised PR but as @cbuescher mention in this comment , AS it's related to deprecation, hence he wanted to discuss this more with his team, I am also waiting for his update.

@cbuescher let me know if you have any update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants