Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from using docvalue_fields to extracting values from _source #44062

Merged
merged 13 commits into from
Jul 24, 2019

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Jul 8, 2019

Doing this means parsing the _source and handling any fields that need to be parsed and have its outcome potentially changed by the parsers (for example, the numbers when coerce=true), just like Elasticsearch is doing it when it's indexing a document.

Fields that will not be taken from _source (but from docvalue_fields) are: keyword, date, scaled_float, geo_point, geo_shape.

This change also introduces a limitation: aliases type of fields that are NOT part of a tree of sub-fields will not be able to be retrieved anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if aroot field is an alias or not. Fields of the type a.b.c.alias can be extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of fields can be evaluated to being an alias.

Fixes #41852.

where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.

This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.
@astefan astefan requested review from costin and matriv July 8, 2019 10:14
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jtibshirani jtibshirani self-requested a review July 8, 2019 11:30
@astefan
Copy link
Contributor Author

astefan commented Jul 8, 2019

@elasticmachine run elasticsearch-ci/1

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall!
Left some comments but nothing major.

Also, is there a need to have the isAggregatable as an attribute of EsField class? I was thinking it can just delegate to dataType.hasDocValues(), or there is some distinction between them?

@@ -520,9 +522,26 @@ public void resolveAsSeparateMappings(String indexWildcard, String javaRegex, bo
}
EsField field = indexFields.flattedMapping.get(fieldName);
if (field == null || (invalidField != null && (field instanceof InvalidMappedField) == false)) {
int dot = fieldName.lastIndexOf('.');
/*
* Looking up the "tree" at the parent fields here to see if the field is an alias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline that this isn't a very solid way of checking whether a field is an alias. The fact that we don't return parent information about field aliases is kind of an 'accident' of the field caps logic, and I don't think it was an intentional API decision.

I filed #44298 to propose adding explicit field alias information to the field caps API, which would help make this more robust. It might also give enough information to be able to resolve + extract the alias field values from _source.

@astefan
Copy link
Contributor Author

astefan commented Jul 18, 2019

@matriv thank you for the review. I've addressed those and added more tests and covered some more scenarios in code. Would you mind having a second look, please?

@astefan astefan requested a review from matriv July 18, 2019 07:02
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work with the extensive testing and code comments.
Left a few more comments.

`date`, `scaled_float`, `geo_point`, `geo_shape`.
Most of {es-sql}'s columns are retrieved from the document's `_source` and there is no attempt to get the columns content from
`docvalue_fields` not even in the case <<mapping-source-field,`_source`>> field is disabled in the mapping explicitly.
If a column, for which there is no source stored, is asked for in a query, {es-sql} will not return it. Field types NOT returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: Field types that don't follow this restriction are: .... since they are NOT returned from _source but from doc values.

@@ -84,6 +88,7 @@ public FieldHitExtractor(String name, DataType dataType, ZoneId zoneId, boolean

FieldHitExtractor(StreamInput in) throws IOException {
fieldName = in.readString();
fullFieldName = in.readOptionalString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need some version check here to ensure backwards compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have this in place in other place in code? I mean, at the moment we enforce strict version matching for jdbc and, I think, for cli as well....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not for server-client mismatch but for version mismatch between nodes during a rolling upgrade.

return l;
Number result = null;
try {
result = NumberType.valueOf(dataType.esType.toUpperCase(Locale.ROOT)).parse(values, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract this to a method in DataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumberType.valueOf(dataType.esType.toUpperCase(Locale.ROOT))
so it's a class method of a DataType and one can do:
dataType.numberType()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, please, it's minor and future oriented in case we want to reuse it somewhere else.

}
}
while (rootField.parent() != null) {
fullFieldName = rootField.parent().field().getName() + "." + fullFieldName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I think it makes sense to make fullFieldName a StringBuilder because of the potential iterations in the while loop.

// Only if the field is not an alias (in which case it will be taken out from docvalue_fields if it's isAggregatable()),
// go up the tree of parents until a non-object (and non-nested) type of field is found and use that specific parent
// as the field to extract data from, from _source. We do it like this because sub-fields are not in the _source, only
// the root field to which those sub-fields belong to, are.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add also an example with a subfield so it's more clear for someone that reads it in the future.

@matriv
Copy link
Contributor

matriv commented Jul 18, 2019

@astefan Should we also add a sentence in limitations page to say that if there are many docvalue fields (like date) a use can still experience the issue when doing a SELECT * ?

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Once again: great work!

@astefan astefan requested review from jimczi and removed request for costin July 22, 2019 17:06
@astefan
Copy link
Contributor Author

astefan commented Jul 23, 2019

@elasticmachine run elasticsearch-ci/2

@astefan astefan merged commit 8bf8a05 into elastic:master Jul 24, 2019
astefan added a commit to astefan/elasticsearch that referenced this pull request Jul 24, 2019
…lastic#44062)

* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.

(cherry picked from commit 8bf8a05)
astefan added a commit that referenced this pull request Jul 25, 2019
…44062) (#44804)

* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.

(cherry picked from commit 8bf8a05)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: consider _source instead of docvalue_fields for most of the fields' retrieval
5 participants