Expose preserveOriginal in NGramTokenFilterFactory which is marked as TODO in master code. #55431

amitmbm · 2020-04-19T10:28:14Z

preserveOriginal setting is currently not supported in the NGramTokenFilter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenfilter.html and there is even TODO comment in the master code of Elasticsearch(as of 19th Apr, 2020) to Expose preserveOriginal as shown in this GitHub code link https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/NGramTokenFilterFactory.java#L53

Elasticsearch version (bin/elasticsearch --version):
8.0.0-SNAPSHOT

Plugins installed: []
N/A

JVM version (java -version):
openjdk 14.0.1 2020-04-14
OpenJDK Runtime Environment (build 14.0.1+7)
OpenJDK 64-Bit Server VM (build 14.0.1+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Darwin LT6577 19.3.0 Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:
Its a feature request and mentioned in the TODO of Elasticsearch master code, if provided preserve original functionality would work with n-gram token filter.
Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Delete the existing index with the name preserveoriginal to test this feature.
curl --user elastic:password -XDELETE localhost:9200/preserveoriginal
Create a new index with custom analyzer which usesngram token filter.

curl --user elastic:password -X PUT "localhost:9200/preserveoriginal?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "max_ngram_diff": 50,
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 2
                }
            },
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "ngram_filter"
                    ]
                }
            }
        }
    }
}
'

Check the tokens generated by ngram_analyzer created in the above step:

curl --user elastic:password -X GET "localhost:9200/preserveoriginal/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "ngram_analyzer",
  "text" : "foo"
}
'

Output of above analyzer API.

{
  "tokens" : [
    {
      "token" : "f",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "fo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "o",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "oo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "o",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

Please see foo original token isn't present in the result.

Provide logs (if relevant):
N/A

The text was updated successfully, but these errors were encountered:

amitmbm · 2020-04-19T12:12:47Z

Raises PR to address this #55432 , Please have a look and comment if more information required.

elasticmachine · 2020-04-20T08:26:33Z

Pinging @elastic/es-search (:Search/Analysis)

amitmbm · 2020-05-04T11:33:16Z

Addressed this issue in my PR #55432 , hence closing this.

markharwood added the :Search Relevance/Analysis How text is split into tokens label Apr 20, 2020

amitmbm closed this as completed May 4, 2020

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose preserveOriginal in NGramTokenFilterFactory which is marked as TODO in master code. #55431

Expose preserveOriginal in NGramTokenFilterFactory which is marked as TODO in master code. #55431

amitmbm commented Apr 19, 2020 •

edited

Loading

amitmbm commented Apr 19, 2020 •

edited

Loading

elasticmachine commented Apr 20, 2020

amitmbm commented May 4, 2020

Expose preserveOriginal in NGramTokenFilterFactory which is marked as TODO in master code. #55431

Expose preserveOriginal in NGramTokenFilterFactory which is marked as TODO in master code. #55431

Comments

amitmbm commented Apr 19, 2020 • edited Loading

amitmbm commented Apr 19, 2020 • edited Loading

elasticmachine commented Apr 20, 2020

amitmbm commented May 4, 2020

amitmbm commented Apr 19, 2020 •

edited

Loading

amitmbm commented Apr 19, 2020 •

edited

Loading