Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Term Vectors with artificial docs and keyword fields #53504

Merged
merged 8 commits into from
Mar 13, 2020

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Mar 12, 2020

Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for string()
on IndexableField is not enough, since for KeywordFieldType
binaryValue() must be used instead.

Fixes #53494

Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for `string()`
on `IndexableField` is not enough, since for `KeywordFieldType`
`binaryValue()` must be used instead.

Fixes elastic#53494
@matriv matriv added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.7.0 v7.6.2 labels Mar 12, 2020
@matriv matriv requested a review from jimczi March 12, 2020 17:59
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @matriv, I left a comment regarding the applicability of the fix.

if (field.name().equals(name)) {
if (field.stringValue() != null) {
result.add(field.stringValue());
} else if (field.binaryValue() != null) { // KeywordFieldType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use the binary value without checking the type of the field. It could be a real binary field that doesn't store utf8 bytes. Since this function is only used in the terms vector service, can you move it there and adds a check to extract binary values only if the field is of type keyword ?

Copy link
Contributor Author

@matriv matriv Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is called as you said only by TermVectorService and it's guarded by isValidField:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/termvectors/TermVectorsService.java#L304 but I agree, it makes sense to move the method close to this so it's visible.

@matriv
Copy link
Contributor Author

matriv commented Mar 12, 2020

@jimczi I moved it to TermVectorService but I'm unsure if it worths avoiding the instanceof.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the function. I left one comment regarding the field value extraction.

for (IndexableField field : fields) {
if (field.stringValue() != null) {
result.add(field.stringValue());
} else if (field.binaryValue() != null) { // KeywordFieldType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will create duplicates due to doc values field that also use a binary value (SortedSetDocValuesField). You should change this to something like:

if (field.fieldType() instanceof KeywordFieldMapper.KeywordFieldType) {
  result.add(field.binaryValue().utf8ToString());
} else if (field.fieldType() instanceof StringFieldType) {
  result.add(field.stringValue());
}

@matriv matriv requested a review from jimczi March 13, 2020 12:58
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matriv matriv merged commit 1fc3fe3 into elastic:master Mar 13, 2020
@matriv matriv deleted the fix-53494 branch March 13, 2020 15:17
matriv added a commit to matriv/elasticsearch that referenced this pull request Mar 13, 2020
Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for `string()`
on `IndexableField` is not enough, since for `KeywordFieldType`
`binaryValue()` must be used instead.

Fixes elastic#53494

(cherry picked from commit 1fc3fe3)
matriv added a commit that referenced this pull request Mar 13, 2020
…3550)

Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for `string()`
on `IndexableField` is not enough, since for `KeywordFieldType`
`binaryValue()` must be used instead.

Fixes #53494

(cherry picked from commit 1fc3fe3)
matriv added a commit that referenced this pull request Mar 13, 2020
…3551)

Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for `string()`
on `IndexableField` is not enough, since for `KeywordFieldType`
`binaryValue()` must be used instead.

Fixes #53494

(cherry picked from commit 1fc3fe3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v7.6.2 v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Term Vectors doesn't work on artificial docs with keyword fields
4 participants