Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose preserve_original setting in edge ngram token filter #55767

Closed
amitmbm opened this issue Apr 25, 2020 · 1 comment
Closed

Expose preserve_original setting in edge ngram token filter #55767

amitmbm opened this issue Apr 25, 2020 · 1 comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@amitmbm
Copy link
Contributor

amitmbm commented Apr 25, 2020

preserve_original setting is currently not supported in the EdgeNGramTokenFilter https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-edgengram-tokenfilter.html#analysis-edgengram-tokenfilter
and there is even TODO comment in the master code of Elasticsearch(as of 25th Apr 2020) to Expose preserve_original as shown in this GitHub code link
https://github.com/elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/EdgeNGramTokenFilterFactory.java#L66

Elasticsearch version (bin/elasticsearch --version):
8.0.0-SNAPSHOT

Plugins installed: []
N/A

JVM version (java -version):
openjdk 14.0.1 2020-04-14
OpenJDK Runtime Environment (build 14.0.1+7)
OpenJDK 64-Bit Server VM (build 14.0.1+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Darwin LT6577 19.3.0 Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:
Its a feature request and mentioned in the TODO of Elasticsearch master code, if provided preserve original functionality would work with n-gram token filter.
Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

1. Delete the existing index with the name preserveoriginal to test this feature.
curl --user elastic:password -XDELETE localhost:9200/preserveoriginal
2. Create a new index with custom analyzer which uses edge-ngram token filter.

curl --user elastic:123456 -X PUT "localhost:9200/preserveoriginal?pretty" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "max_ngram_diff": 50,
        "analysis": {
            "filter": {
                "edge_ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 2
                }
            },
            "analyzer": {
                "edge_ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "edge_ngram_filter"
                    ]
                }
            }
        }
    }
}
'
  1. Check the tokens generated by ngram_analyzer created in the above step:
curl --user elastic:123456 -X GET "localhost:9200/preserveoriginal/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "edge_ngram_analyzer",
  "text" : "foo"
}
'

4.The output of the above analyzer API.

{
  "tokens" : [
    {
      "token" : "f",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "fo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pease note foo original token isn't present in the result.

@amitmbm amitmbm changed the title expose preserve_original setting in edge ngram token filter Expose preserve_original setting in edge ngram token filter Apr 25, 2020
@dnhatn dnhatn added the :Search Relevance/Analysis How text is split into tokens label Apr 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 27, 2020
@cbuescher cbuescher assigned cbuescher and unassigned cbuescher Apr 27, 2020
cbuescher pushed a commit that referenced this issue Apr 28, 2020
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.

Closes #55767
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

5 participants