Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FVH highlighter duplicates the highlight text #29654

Open
jacool opened this issue Apr 23, 2018 · 7 comments
Open

FVH highlighter duplicates the highlight text #29654

jacool opened this issue Apr 23, 2018 · 7 comments
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jacool
Copy link

jacool commented Apr 23, 2018

Elasticsearch version (bin/elasticsearch --version):
6.2.3

Plugins installed: []

JVM version (java -version):
openjdk version "1.8.0_161"

OS version (uname -a if on a Unix-like system):
Linux 5137c3a21142 4.9.87-linuxkit-aufs

Description of the problem including expected versus actual behavior:
When using the "sentence" mode with FVH highlighter some highlight texts are returned fully or partially duplicated. See the reproduction example below. The second returned highlight contains the first one fully, thus users would expect only one highlight being returned (the second one) with the first occurrence of the word "go" emphasized as well as the second occurrence. (As is usually the case with this highlighter when the searched word appears nearby several times). Another acceptable solution would be that the first sentence would not appear in the second highlight at all.

Expected result:
"I don't have access to his calendar but let me <em>go</em> and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me <em>go</em> on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates. "

In this specific case there is a partial duplication of highlight texts, we have observed full duplication as well in other cases.

Steps to reproduce:

PUT /test
{
  "mappings": {
    "t": {
      "properties": {
        "message": {
          "type": "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

POST /test/t/1
{
    "message": "I don't have access to his calendar but let me go and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me go on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates."
}

GET /test/_search
{
  "version": true,
  "query": {
    "match_phrase": {
      "message": {
        "query": "go"
      }
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "boundary_max_scan":100,
        "fragment_size":100,
        "type":"fvh",
        "number_of_fragments":20,
        "boundary_scanner":"sentence"
      }
    }
  }
}

Results:

"highlight": {
          "message": [
            "I don't have access to his calendar but let me <em>go</em> and have a chat to him because I'm I'm really came to get just to get this up and running. ",
            "I don't have access to his calendar but let me go and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me <em>go</em> on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates. "
          ]
        }
@javanna javanna added the :Search Relevance/Highlighting How a query matched a document label Apr 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 23, 2018

@jacool Thanks for reporting this issue. I can reproduce this as well.

@jimczi The bug is on Lucene level, how fragments are being built in BaseFragmentsBuilder.java.
First, we get temporary fragInfos with the offsets [0,100] and [105,205] with requested fragment_size.
Then here, we go through each fragInfo, and building textual fragments from it, using Java's sentence breakIterator. Java's breakIterator finds the end_offset of the 1st textual fragment as 141. But we don't update the second fragInfo with this new information, that it now should start with 142, and not 105 anymore.

@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match info?

@colings86 colings86 added the >bug label Apr 24, 2018
@jimczi
Copy link
Contributor

jimczi commented Apr 24, 2018

@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match info?

The unified highlighter should produce the expected snippets so if the fix is not trivial I'd advise to switch to the default highlighter in 6. @jacool any reason to use the fvh highlighter rather than the unified ? The unified highlighter automatically detects if a field is indexed with term_vectors and can also detect sentences so you should be able to get what you want.

@jacool
Copy link
Author

jacool commented Apr 24, 2018

@jimczi The unified highlighter is of no value to us due to Issue #29561

@jimczi
Copy link
Contributor

jimczi commented Apr 24, 2018

ok thanks @jacool, #29561 needs a fix in Lucene, I'll dig.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 14, 2024

Still an issue in Elasticsearch 8.12 with fvh highlighter, no such problem for unified highlighter

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@javanna javanna added the priority:normal A label for assessing bug priority to be used by ES engineers label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants