-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Term Vectors with artificial docs and keyword fields #53504
Conversation
Previously, Term Vectors API was returning empty results for artificial documents with keyword fields. Checking only for `string()` on `IndexableField` is not enough, since for `KeywordFieldType` `binaryValue()` must be used instead. Fixes elastic#53494
Pinging @elastic/es-search (:Search/Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @matriv, I left a comment regarding the applicability of the fix.
if (field.name().equals(name)) { | ||
if (field.stringValue() != null) { | ||
result.add(field.stringValue()); | ||
} else if (field.binaryValue() != null) { // KeywordFieldType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use the binary value without checking the type of the field. It could be a real binary
field that doesn't store utf8 bytes. Since this function is only used in the terms vector service, can you move it there and adds a check to extract binary values only if the field is of type keyword
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is called as you said only by TermVectorService and it's guarded by isValidField:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/termvectors/TermVectorsService.java#L304 but I agree, it makes sense to move the method close to this so it's visible.
@jimczi I moved it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving the function. I left one comment regarding the field value extraction.
for (IndexableField field : fields) { | ||
if (field.stringValue() != null) { | ||
result.add(field.stringValue()); | ||
} else if (field.binaryValue() != null) { // KeywordFieldType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will create duplicates due to doc values field that also use a binary value (SortedSetDocValuesField
). You should change this to something like:
if (field.fieldType() instanceof KeywordFieldMapper.KeywordFieldType) {
result.add(field.binaryValue().utf8ToString());
} else if (field.fieldType() instanceof StringFieldType) {
result.add(field.stringValue());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously, Term Vectors API was returning empty results for artificial documents with keyword fields. Checking only for `string()` on `IndexableField` is not enough, since for `KeywordFieldType` `binaryValue()` must be used instead. Fixes elastic#53494 (cherry picked from commit 1fc3fe3)
Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for
string()
on
IndexableField
is not enough, since forKeywordFieldType
binaryValue()
must be used instead.Fixes #53494