-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Histogram field mapper that supports percentiles aggregations. #48580
Conversation
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it a lot. My main concern is that the parsing of aggregations was made lenient for this to work by accepting any field in the parser. I think it's fine that we use generic objects internally like ValuesSource
and do instanceof
calls to see how these objects should be handled. However I'd like the parser to be strict and fail if a field is provided that is neither a numeric field or a histogram field?
Let's add docs?
Accuracy and suitability of the format.
It looks good to me.
The field mapper does not support as input an array of histograms, does it need it to support it?
I don't think it needs to. We could document that this field only supports single values, like we do for vector fields (you might want to borrow the logic that vectors used to make sure fields are actually single-valued).
Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).
I think we should ignore this problem on our end and push this problem to the upstream libraries, since this is a general problem that they have given that they allow collecting multiple instances of the same value at once. The fact that you require counts to be integers and reject longs sounds to me like the right thing to do on our end.
...ain/java/org/elasticsearch/search/aggregations/metrics/AbstractHDRPercentilesAggregator.java
Outdated
Show resolved
Hide resolved
return new HDRPercentilesAggregator(name, valuesSource, searchContext, parent, percents, numberOfSignificantValueDigits, keyed, | ||
config.format(), pipelineAggregators, metaData); | ||
config.format(), pipelineAggregators, metaData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try to undo formatting changes in this file and a couple other ones to keep the diff readable?
@@ -80,7 +79,7 @@ | |||
static { | |||
PARSER = new ConstructingObjectParser<>(PercentileRanksAggregationBuilder.NAME, false, | |||
(a, context) -> new PercentileRanksAggregationBuilder(context, (List) a[0])); | |||
ValuesSourceParserHelper.declareNumericFields(PARSER, true, false, false); | |||
ValuesSourceParserHelper.declareAnyFields(PARSER, true, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to be less lenient and require a field that is either numeric or a histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense and that is the tricky part. @not-napoleon I might want to borrow a bit of your thoughts on how to do this in the best way possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser doesn't currently support that, and right now the convention is to use ANY for aggregations that can accept more than one type of input. Cleaning that up is a goal of the ValuesSource refactor effort I'm currently working on (see #42949)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @not-napoleon, let's keep this TODO for later then @iverase ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think to be able to support that we require the effort that @not-napoleon is working on (thanks for taking a look!).
return getHistogramValue(values.binaryValue()); | ||
} catch (IOException e) { | ||
throw new IllegalStateException("Cannot load doc value", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make histogram()
throw an IOException
for consistency with BinaryDocValues#binaryValue
?
} catch (IOException e) { | ||
throw new IllegalStateException("Cannot load doc values", e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throw an UnsupportedOperationException in the two above methods.
@Override | ||
public SortField sortField(Object missingValue, MultiValueMode sortMode, | ||
XFieldComparatorSource.Nested nested, boolean reverse) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should throw an UnsupportedOperationException too
private HistogramValue getHistogramValue(final BytesRef bytesRef) throws IOException { | ||
final ByteBufferStreamInput streamInput = new ByteBufferStreamInput( | ||
ByteBuffer.wrap(bytesRef.bytes, bytesRef.offset, bytesRef.length)); | ||
final int numValues = streamInput.readVInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also avoid storing the length and consider the iterator exhausted when all bytes have been read.
context.path().add(simpleName()); | ||
try { | ||
List<Double> values = null; | ||
List<Integer> counts = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use native variants, e.g. IntArrayList and DoubleArrayList?
if (values.size() == 0) { | ||
throw new MapperParsingException("error parsing field [" | ||
+ name() + "], arrays for values and counts cannot be empty"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should actually fail it or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, maybe we should require that values come in order and fail if there are duplicate values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respect empty arrays, yes we can just ignore it instead of failing.
I like he idea of requiring values to be ordered and disallow duplicates. Before implementing I will wait for more input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code so in case of empty arrays we just ignore it as we do with null values.
In addition we require now that values are provided in increasing order.
Really looking forward to this 😍 Question: does the Is it possible to extend the range in subsequent documents? Example:
Is it possible to use different buckets in subsequent documents? Example:
Is it possible to omit buckets with a value of zero? Example:
These are not necessarily requirements we have in APM, just trying to get a feel for what's possible. |
...java/org/elasticsearch/search/aggregations/metrics/AbstractTDigestPercentilesAggregator.java
Show resolved
Hide resolved
import java.util.List; | ||
|
||
|
||
public class HistogramAggregationTests extends ESSingleNodeTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming is a bit confusing since there is also a histogram aggregation which is completely different to this new field. Maybe we should call this HistogramPercentileAggregationTests
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally created this test to check different approaches. I renamed it following your suggestions.
and use that to decide the collecting method
Yes that is possible as every histogram is independent to the others so it can have different buckets and different number of buckets.
The count for a bucket can be zero but the bucket must be present. In general the length of the |
One thing to note is that it's possible in TDigest particularly to have multiple buckets with the same value. This can happen if the same value is inserted enough times to exceed the threshold for the centroid which will force a new centroid with the same value to be created. IF we move to reject histograms with multiple of the same bucket, it will make the code for extracting the TDigest histogram (from your PR description) less simple since the user will need to keep track of whether the value has changed when moving to the next centroid. I don't see a problem with requiring ordered buckets though |
Good point Colin, I agree that rejecting duplicates would make it tricky to work with t-digests. |
If we allow duplicates, I don't think we require buckets to be ordered. wdyt? |
I think we should at least make sure values are sorted in the doc-value representation. Then we could either sort buckets ourselves or require users to provide us with sorted buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like the ignoreMalformed logic to be a bit more robust, but other than that it looks good to me.
...ain/java/org/elasticsearch/search/aggregations/metrics/AbstractHDRPercentilesAggregator.java
Outdated
Show resolved
Hide resolved
can be extracted either from specific numeric fields in the documents, or | ||
be generated by a provided script. | ||
over numeric values extracted from the aggregated documents. These values can be | ||
generated by a provided script or extracted from specific numeric or histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a link to the histogram field?
can be extracted either from specific numeric fields in the documents, or | ||
be generated by a provided script. | ||
over numeric values extracted from the aggregated documents. These values can be | ||
generated by a provided script or extracted from specific numeric or histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a link to the histogram field?
Map<String, Object> node, ParserContext parserContext) | ||
throws MapperParsingException { | ||
Builder builder = new HistogramFieldMapper.Builder(name); | ||
parseField(builder, name, node, parserContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not call parseField
, none of the properties that it parses are supported by this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am supporting them:
- user can disable docValues
- If user set index, index options or stored values, an error will be thrown. Note setting index or stored values to false is allowed.
I am not handling boost, similarity and copy_to. Shall we throw an error if user defines those fields?
} else if (count > 0) { | ||
// we do not add elements with count == 0 | ||
streamOutput.writeDouble(values.get(i)); | ||
streamOutput.writeVInt(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest putting the count before the values, it might make it easier to better compress in the future by stealing bits of the count.
if (ignoreMalformed.value() == false) { | ||
throw new MapperParsingException("failed to parse field [{}] of type [{}]", | ||
ex, fieldType().name(), fieldType().typeName()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docs look good. I just left a few minor comments/tweaks to make a certain aspect more explicit, otherwise I think they are fine 👍
Been following the code changes from afar, think Adrien has the code review aspect covered so I'll defer to him there :)
histogram was constructed. It is important to consider the percentiles aggregation mode that will be used | ||
to build it. Some possibilities include: | ||
|
||
- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, histograms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, trying to tweak this a little to make it more explicit, so the user knows what the value/count fields do.
- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the
values
array represents the mean centroid positions and thecounts
array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.
WDYT?
can be built by using the mean value of the centroids and the centroid's count. If the algorithm has already | ||
started to approximate the percentiles, this inaccuracy is carried over in the histogram. | ||
|
||
- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, histograms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly,
- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the
values
array represents fixed upper limits of each bucket interval, and thecounts
array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.
??
can be created by using the recorded values and the count at that value. This implementation maintains a fixed worse-case | ||
percentage error (specified as a number of significant digits), therefore the value used when generating the histogram | ||
would be the maximum accuracy you can achieve at aggregation time. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another sentence/paragraph at the end?
The
histogram
field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.
Or something similar... trying to convey to the user that how they index the data is important and they should chose upfront.
@jpountz, I changed the logic so now I am using XContentSubParser to handle ignore malformed (as it should be). My only doubt is about parse fields. I am supporting doc values so a user can disable them. It might be useful if a user wants to store the histogram fields (already parsed) without doc values so at a later stage it can reindex them again with them, wdyt? should be handle boost, similarity and copy_to and throw an error if the user defines them? |
Given that it's always easier to add features than remove them, I'd be in favor of only supporting |
Great, that makes things simpler. I removed support for parse fields |
token = context.parser().nextToken(); | ||
if (subParser != null) { | ||
while (token != null) { | ||
token = subParser.nextToken(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do subParser.close()
instead?
`histogram` fields are primarily intended for use with aggregations. To make it | ||
more readily accessible for aggregations, `histogram` field data is stored as a | ||
binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most | ||
`12 * numValues`, where `numValues` is the length of the provided arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's actually 13 since vints can take up to 5 bytes.
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
This PR explores the addition of a new histogram field mapper that consist in a pre-aggregated format of numerical data to be used in percentiles aggregations.
Mapper
The new field is defined in a mapping using the following structure:
And it can be populated using the following structure:
where
values
is an array of doubles andcounts
is an array of integers and must have the same length. This format is up for discussion but the reasons to choose this format is that they can be easily generated from the existing histograms.For TDigest, they can be generated using the following code (see example):
For HDR histograms, hey can be generated using the following code (see example):
This structure is stored as a binary doc value
Aggregations
This field can be used in standard
percentile
andpercentile_ranks
aggregations. They can be used together with standard numeric fields. In order for this aggregations to support this new format the following interfaces has been created:IndexHistogramFieldData
: A new specialisation of field data for histograms.AtomicHistogramFieldData
: A new specialisation of atomic field data for histograms.HistogramValues
: The docVales returned by the atomic field data.HistogramValue
: A doc value representing one of those histograms.The aggregations has been updated accordingly to be able to understand this new field data.
Open questions
Accuracy and suitability of the format.
The field mapper does not support as input an array of histograms, does it need it to support it?
Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).
Relates #48578